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ABSTRACT 


We  investigate  three  issues  related  to  the  transport  layer,  and  address  these 
issues  using  the  innovative  transport  layer  services  offered  by  the  Stream  Control 
Transmission  Protocol  (SCTP)  [RFC4960]. 

In  the  first  issue,  we  explore  the  benefits  from  SCTP’s  multistreaming 
service  for  HTTP-based  applications.  The  current  web  transport  -  TCP,  offers  a 
sequential  bytestream,  and  in-order  data  delivery  within  the  bytestream.  Transferring 
independent  web  objects  over  a  single  TCP  connection  results  in  head-of-Une  (HOT) 
blocking,  and  worsens  web  response  times.  On  the  contrary,  transferring  these  objects 
over  different  SCTP  streams  eliminates  inter-object  HOT  blocking.  We  propose  a 
design  for  HTTP  over  SCTP  streams,  and  implement  this  design  in  the  open  source 
Apache  web  server  and  Firefox  browser.  Using  emulation,  we  show  that  persistent  and 
pipelined  HTTP  1.1  transfers  over  a  single  multistreamed  SCTP  association  improves 
web  response  times  when  compared  to  similar  transfers  over  a  single  TCP  connection. 
The  difference  in  TCP  vs.  SCTP  response  times  increases  and  is  more  visually 
perceivable  in  high  latency  and  lossy  browsing  condition,  as  found  in  the  developing 
world. 

The  current  workaround  to  improve  an  end  user’s  perceived  WWW 
performance  is  to  download  an  HTTP  transfer  over  multiple  TCP  connections.  While 
we  expect  multiple  TCP  connections  to  improve  HTTP  throughput,  emulation  results 
show  that  the  competing  and  bursty  nature  of  multiple  TCP  senders  degrade  HTTP 
performance  especially  in  end-to-end  paths  with  low  bandwidth  last  hops.  In  such 
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browsing  conditions,  a  single  multistreamed  SCTP  association  not  only  eliminates  HOL 
blocking,  but  also  boosts  throughput  compared  to  multiple  TCP  connections. 

In  the  second  issue,  we  explore  how  SCTP’s  (or  TCP’s)  SACK  mechanism 
degrades  end-to-end  performance  when  out-of-order  data  is  non-rengable.  Using 
simulation,  we  show  that  SACKs  result  in  inevitable  send  buffer  wastage,  which 
increases  as  the  frequency  of  loss  events  and  loss  recovery  durations  increase.  We 
introduce  a  fundamentally  new  ack  mechanism,  Non-Renegable  Selective 
Acknowledgments  (NR-SACKs),  for  SCTP.  An  SCTP  receiver  uses  NR-SACKs  to 
explicitly  identify  some  or  all  out-of-order  data  as  being  non-renegable,  allowing  a 
sender  to  free  up  send  buffer  sooner  than  if  the  data  were  only  SACKed.  Simulation 
comparisons  show  that  NR-SACKs  enable  more  efficient  utilization  of  a  transport 
sender’s  memory,  and  also  improve  throughput  in  Concurrent  Multipath  Transfer 
(CMT)  [Iyengar  2006]. 

The  third  issue  explores  CMT  performance  during  path  failures.  Using 
simulation,  we  demonstrate  how  CMT  suffers  from  significant  “rbuf  blocking”  which 
degrades  performance  during  permanent  and  short-term  path  failures.  To  improve 
performance,  we  introduce  a  new  destination  state  called  the  “Potentially  Failed”  (PF) 
state.  CMT’s  failure  detection  and  (re)transmission  policies  are  augmented  to  include 
the  PF  state,  and  the  modified  CMT  is  called  CMT-PF.  Using  simulation,  we 
demonstrate  that  CMT-PF  outperforms  CMT  during  failures  -  even  under  aggressive 
failure  detection  thresholds.  We  also  show  that  CMT-PF  performs  on  par  or  better  but 
never  worse  than  CMT  during  non-failure  scenarios.  In  light  of  these  findings,  we 
recommend  CMT  be  replaced  by  CMT-PF  in  existing  and  future  CMT  implementations 
and  RFCs. 
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Chapter  1 
INTRODUCTION 


1.1  Dissertation  Scope 

This  dissertation  investigates  three  issues  related  to  the  transport  layer,  and 
addresses  these  issues  to  improve  application  performance.  While  these  issues  are 
explored  using  the  Stream  Control  Transmission  Protocol  (SCTP)  [RFC4960], 
different  subsets  of  the  proposed  ideas  and  performance  conclusions  would  be 
applicable  to  any  reliable  transport  that  provides  services  similar  to  SCTP.  The  rest  of 
this  section  outlines  the  three  issues. 

1.1.1  Issue  (1):  Web  over  Multistreamed  Transport 

Transport  layer  multistreaming  is  the  ability  of  a  transport  protocol  to 
support  multiple  streams,  where  each  stream  is  a  logical  data  flow  with  its  own 
sequencing  space.  Within  each  stream,  the  transport  receiver  delivers  data  in-sequence 
to  the  application,  without  regard  to  the  relative  order  of  data  arriving  on  other 
streams.  This  property  makes  streams  ideal  for  transferring  independent  web  objects. 
When  each  web  object  is  transmitted  on  a  different  stream,  the  processing  and  display 
of  one  object  does  not  depend  on  the  successful  transfer  and  delivery  of  other  object(s). 

The  current  web  transport  -  TCP,  does  not  support  transport  layer 
multistreaming.  At  the  time  TCP  was  designed,  congestion  and  flow  control  were  the 
crucial  transport  layer  services  required  by  network  applications.  Later,  when  HTTP’s 
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design  required  a  reliable  transport  protocol,  TCP  was  the  only  available  option  and 
was  ‘chosen’  for  HTTP  transfers.  However,  transferring  independent  web  objects  over 
TCP  results  in  sub-optimal  response  times,  since,  a  TCP  connection  (i)  offers  a  single 
sequential  bytestream  to  the  application,  and  (ii)  provides  in-order  delivery  within  the 
bytestream  —  if  a  piece  of  one  web  object  is  lost  in  the  network,  successively 
transmitted  web  objects  will  not  be  delivered  to  the  client  until  the  lost  piece  is 
retransmitted  and  received. 

Though  it  is  believed  that  transport  layer  streams  can  improve  web 
response  times  [Gettys  2002],  no  experimentation  or  analysis  exists  to  support  this 
hypothesis.  This  dissertation  provides  some  of  the  analysis.  When  we  started  working 
on  this  issue,  SCTP  was  the  only  transport  that  supported  multistreaming.  Hence,  this 
dissertation  considers  SCTP  streams  for  HTTP  transfers.  More  recently,  [Ford  2007] 
proposed  the  Structured  Stream  Transport  (SST)  protocol  that  functions  similar  to 
SCTP  (discussed  in  Chapter  2). 

1.1.2  Issue  (2):  Reneging  and  Selective  Acks 

Reliable  transport  protocols  (such  as  TCP  and  SCTP)  employ  two  kinds  of 
data  acknowledgment  mechanisms:  (i)  cumulative  acks  indicate  data  that  has  been 
received  in-sequence,  and  (ii)  selective  acks  (SACKs)  indicate  data  that  has  been 
received  out-of-order.  While  cum-acked  data  is  a  receiver’s  responsibility,  SACKed 
data  is  not.  SACKed  out-of-order  data  is  implicitly  renegablc,  that  is,  a  receiver  may 
SACK  data  and  later  discard  it.  The  possibility  of  reneging,  however  remote,  forces  a 
transport  sender  to  maintain  copies  of  SACKed  data  in  the  send  buffer  until  they  are 
cum-acked. 
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Data  that  has  been  delivered  to  the  application,  by  definition,  is  non- 
renegable  by  the  transport  receiver.  Unlike  TCP  which  never  delivers  out-of-order  data 
to  the  application,  SCTP’s  multistreaming  and  unordered  data  delivery  services  result 
in  out-of-order  data  being  delivered  to  the  application  and  thus  becoming  non- 
renegable.  Interestingly,  TCP  and  SCTP  implementations  can  be  configured  such  that 
the  receiver  is  not  allowed  to  and  therefore  never  reneges  on  out-of-order  data. 

This  dissertation  investigates  the  negative  effects  of  the  SACK  mechanism 
when  out-of-order  data  is  non-renegable.  While  non-renegable  out-of-order  data  is 
possible  in  both  TCP  and  SCTP,  note  that  the  possibility  is  innate  to  SCTP  due  to 
SCTP’s  out-of-order  data  delivery  services.  Therefore,  our  investigations  focus  on 
SCTP. 

1.1.3  Issue  (3):  CMT  during  Path  Failures 

A  host  is  multihomed  if  it  can  be  addressed  by  multiple  IP  addresses 
[RFC  1122],  as  is  the  case  when  the  host  has  multiple  network  interfaces.  Multiple 
active  interfaces  also  suggest  the  simultaneous  existence  of  multiple  paths  between  the 
multihomed  hosts.  CMT  [Iyengar  2006]  exploits  these  multiple  paths  for  simultaneous 
transfer  of  new  data  between  end  hosts,  and  increases  a  network  application’s 
throughput.  [Iyengar  2006]  evaluated  CMT  over  paths  with  asymmetric  delay  and  loss 
characteristics.  But  [Iyengar  2006]  did  not  consider  path  failures,  which  is  the  scope  of 
our  work. 

Both  TCP  and  UDP  are  unaware  of  multihoming.  Hence,  [Iyengar  2006] 
used  the  multihomed-aware  transport  protocol  -  SCTP,  to  perform  CMT  at  the 
transport  layer.  Since  this  research  is  a  continuation  of  [Iyengar  2006],  our 
investigations  also  use  SCTP.  Incidentally,  SCTP  also  supports  path  failure  detection. 
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1.2  An  SCTP  Primer 


SCTP  was  originally  developed  to  carry  telephony  signaling  messages  over 
IP  networks.  With  continued  work,  SCTP  evolved  into  a  general  purpose  transport 
protocol  with  advanced  delivery  options  [RFC4960].  Similar  to  TCP,  SCTP  provides  a 
reliable,  full-duplex,  congestion  and  flow-controlled  connection,  called  an  association. 
An  SCTP  packet,  or  more  generally,  protocol  data  unit  (PDU),  consists  of  one  or  more 
concatenated  building  blocks  called  chunks:  either  control  or  data.  For  the  purposes  of 
reliability  and  congestion  control,  each  data  chunk  in  an  association  is  assigned  a 
unique  Transmission  Sequence  Number  (TSN).  Since  chunks  are  atomic,  TSNs  are 
associated  with  chunks  of  data,  as  opposed  to  TCP  which  associates  a  sequence 
number  with  each  data  octet  in  the  bytestream. 

Unlike  TCP,  SCTP  offers  innovative  transport  layer  services  such  as 
multihoming  and  multistreaming. 

1.2.1  SCTP  Multistreaming 

An  SCTP  stream  is  a  unidirectional  data  flow  within  an  SCTP  association. 
Independent  application  objects  can  be  transmitted  in  different  streams  to  maintain  their 
logical  separation  during  transfer  and  delivery.  All  SCTP  streams  within  an  association 
are  subject  to  shared  congestion  control,  and  thus  SCTP’s  multistreaming  adheres  to 
TCP’s  fairness  principles. 

Figure  1.1  illustrates  a  multistreamed  association  between  hosts  A  and  B. 
In  this  example,  host  A  uses  three  output  streams  to  host  B  (numbered  0  to  2),  and  has 
only  one  input  stream  from  host  B  (numbered  0).  The  number  of  input  and  output 
streams  in  an  SCTP  association  is  negotiated  during  association  setup.  SCTP  uses 
Stream  Sequence  Numbers  (SSNs)  to  preserve  data  order  within  each  stream. 
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However,  maintaining  order  of  delivery  between  transport  protocol  data  units  (TPDUs) 
transmitted  on  different  streams  is  not  a  constraint.  That  is,  data  arriving  in-order 
within  an  SCTP  stream  is  delivered  to  an  application  without  regard  to  data  arriving  on 
other  streams. 


Host  A  Host  B 


Figure  1.1:  Multistreamed  Association  between  Hosts  A  and  B 

1.2.2  SCTP  Multihoming 

To  benefit  from  network  interface  redundancy  and  provide  end-to-end 
network  fault  tolerance,  SCTP  supports  multihoming  at  the  transport  layer.  An  SCTP 
endpoint  may  bind  to  multiple  IP  addresses  during  association  initialization.  Referring 
to  Figure  1.2,  let  us  contrast  SCTP  with  TCP  to  further  explain  SCTP’s  multihoming 
feature.  Four  distinct  TCP  connections  are  possible  between  Hosts  A  and  B:  (Ai,Bi), 
(Ai,B2),  (A2,Bi),  (A2,B2).  SCTP,  on  the  other  hand,  is  not  forced  to  choose  a  single  IP 
address  on  each  host.  Instead,  a  single  SCTP  association  could  consist  of  two  sets  of  IP 
addresses,  which  in  our  example  would  be:  ({Ai,A2},  {Bi,B2}).  Each  endpoint  chooses 
a  single  destination  address  as  a  primary  destination  address,  which  is  used  for 
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transmission  of  new  data.  Note  that  a  single  port  number  is  used  at  eaeh  endpoint 
regardless  of  the  number  of  IP  addresses. 


SCTP  monitors  the  reaehabiUty  of  eaeh  destination  address  through  two 
meehanisms:  aeks  of  data  and  periodie  probes  known  as  heartbeats.  Failure  in  reaehing 
the  primary  destination  results  in  failover,  where  an  SCTP  endpoint  dynamieally 
ehooses  an  alternate  destination  to  transmit  the  data,  until  the  primary  destination 
beeomes  reaehable  again. 

1.2.3  Concurrent  Multipath  Transfer 

Multihoming  among  networked  maehines  and  deviees  is  a  teehnologieally 
feasible  and  inereasingly  eeonomieal  proposition.  Multihomed  nodes  may  be 
simultaneously  eonneeted  through  multiple  end-to-end  paths  to  inerease  resiUenee  to 
path  failure.  For  instanee,  users  may  be  simultaneously  eonneeted  through  dial¬ 
up/broadband,  or  via  multiple  wireless  teehnologies  sueh  as  802.11b  and  GPRS. 
Coneurrent  Multipath  Transfer  (CMT)  [Iyengar  2006]  is  an  experimental  extension  to 
SCTP  that  assumes  multiple  independent  paths,  and  exploits  these  paths  for 
simultaneous  transfer  of  new  data  between  end  hosts.  A  naive  version  of  CMT,  where 
a  data  sender  simply  transfers  new  data  over  multiple  paths,  inereases  data  reordering 
and  adversely  affeets  performanee.  [Iyengar  2006]  investigates  these  negative  effeets 
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and  proposes  algorithms  and  retransmission  policies  that  improve  application 
throughput. 

1.3  Dissertation  Overview 

A  structural  overview  of  the  dissertation  is  shown  in  Figure  1.3.  The  three 
issues  are  discussed  in  Chapters  2,  3  and  4,  respectively.  The  references  cited  for  each 
chapter  represent  the  author’s  publications  for  each  topic. 

Chapter  2  presents  our  work  on  the  first  issue  -  web  over  multistreamed 
SCTP.  The  chapter  proposes  a  design  for  HTTP  over  SCTP  streams,  and  discusses  our 
efforts  to  implement  the  design  in  the  popular  Apache  web  server  and  Firefox  browser. 
Using  emulation,  we  show  that  persistent  and  pipelined  HTTP  1.1  transfers  over  a 
single  multistreamed  SCTP  association  improves  web  response  times  when  compared 
to  similar  transfers  over  a  single  TCP  connection.  The  difference  in  TCP  vs.  SCTP 
response  times  increases  and  is  more  visually  perceivable  in  high  latency  and  lossy 
browsing  condition,  as  found  in  the  developing  world. 

The  current  workaround  to  improve  an  end  user’s  perceived  WWW 
performance  is  to  download  an  HTTP  transfer  over  multiple  TCP  connections.  While 
we  expect  multiple  TCP  connections  to  improve  HTTP  throughput,  emulation  results 
show  that  the  competing  and  bursty  nature  of  multiple  TCP  senders  degrade  HTTP 
performance  especially  in  end-to-end  paths  with  low  bandwidth  last  hops.  In  such 
browsing  conditions,  a  single  multistreamed  SCTP  association  not  only  eliminates  HOT 
blocking,  but  also  boosts  throughput  compared  to  multiple  TCP  connections.  These 
experiments  were  performed  as  part  of  this  author’s  summer  2008  internship  at  Cisco 
Systems. 
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Our  body  of  work  in  HTTP  over  SCTP  has  triggered  significant  interest  in 
the  area.  The  Protocol  Engineering  Lab  has  secured  additional  funding  from  Cisco 
Systems  to  pursue  some  of  the  ongoing  and  future  work  discussed  in  Chapter  2. 
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Figure  1.3:  Dissertation  Structure 


Chapter  3  discusses  the  second  issue  -  how  the  existing  SACK  mechanism 
degrades  end-to-end  performance  when  out-of-order  data  is  non-rengable.  Using 
simulation,  we  show  that  SACKs  result  in  inevitable  send  buffer  wastage,  which 
increases  as  the  frequency  of  loss  events  and  loss  recovery  durations  increase.  We 
introduce  a  fundamentally  new  ack  mechanism,  Non-Renegable  Selective 
Acknowledgments  (NR-SACKs),  for  SCTP.  An  SCTP  receiver  uses  NR-SACKs  to 
explicitly  identify  some  or  all  out-of-order  data  as  being  non-renegable,  allowing  a 
sender  to  free  up  send  buffer  sooner  than  if  the  data  were  only  SACKed.  Simulation 
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comparisons  show  that  NR-SACKs  enable  more  efficient  utilization  of  a  transport 
sender’s  memory.  Further  investigations  show  that  NR-SACKs  also  improve 
throughput  in  CMT.  The  final  section  of  Chapter  3  discusses  ongoing  activity, 
including  our  efforts  within  the  IETF  to  standardize  NR-SACKs  for  SCTP,  and  at  UD 
to  implement  NR-SACKs  in  FreeBSD  SCTP. 

Chapter  4  presents  our  work  on  the  third  issue  -  CMT  performance  during 
path  failures.  Using  simulation,  we  demonstrate  how  CMT  suffers  from  significant 
“rbuf  blocking”  which  degrades  performance  during  permanent  and  short-term  path 
failures.  To  improve  performance,  we  introduce  a  new  destination  state  called  the 
“Potentially  Failed”  (PF)  state.  CMT’s  failure  detection  and  (re)transmission  policies 
are  augmented  to  include  the  PF  state,  and  the  modified  CMT  is  called  CMT-PF.  Using 
simulation,  we  demonstrate  that  CMT-PF  outperforms  CMT  during  failures  -  even 
under  aggressive  failure  detection  thresholds.  We  also  show  that  CMT-PF  performs  on 
par  or  better  but  never  worse  than  CMT  during  non-failure  scenarios.  In  light  of  these 
findings,  we  recommend  CMT  be  replaced  by  CMT-PF  in  existing  and  future  CMT 
implementations  and  RFCs.  Chapter  4  finishes  with  a  discussion  of  our  on-going  effort 
to  implement  CMT-PF  in  FreeBSD  SCTP. 

Finally,  Chapter  5  summarizes  our  contributions,  and  concludes  this 

dissertation. 
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Chapter  2 

HTTP  OVER  MULTISREAMED  TRANSPORT 

This  chapter  discusses  the  first  problem  -  HTTP  over  SCTP  streams. 
Sections  2. 1  and  2.2  explain  the  head-of-Une  (HOT)  blocking  problem  and  its  negative 
consequences  in  HTTP  over  TCP.  Section  2.3  describes  our  design  of  HTTP  over 
multistreamed  SCTP.  Sections  2.4  and  2.5  discuss  HTTP  over  SCTP  implementation 
specifics  in  the  Apache  web  server  and  Firefox  web  browser,  respectively.  Section  2.6 
explains  evaluation  preUminaries  and  Sections  2.7  and  2.8  present  results.  Section  2.9 
concludes  and  presents  ongoing  and  future  work.  Section  2.10  discusses  related  work. 

2.1  Introduction 

HTTP  [RFC2616]  requires  a  reUable  transport  protocol  for  end-to-end 
communication.  While  historically  TCP  has  been  used  for  this  purpose,  HTTP  does  not 
require  TCP.  A  TCP  connection  offers  a  single  sequential  bytestream  to  a  web  server. 
In  the  case  of  HTTP  1.1  with  persistence  and  pipelining,  the  independent  HTTP 
responses  are  serialized  and  sent  sequentially  over  a  single  connection  (i.e.,  one  TCP 
bytestream).  In  addition,  a  TCP  connection  provides  in-order  delivery  within  the 
bytestream  —  if  a  TPDU  containing  HTTP  response  i  is  lost  in  the  network,  successive 
TPDUs  containing  HTTP  responses  i-\-n  (n>l)  will  not  be  delivered  to  the  web  client 
until  the  lost  TPDU  is  retransmitted  and  received.  This  situation,  known  as  head-of- 
line  (HOL)  blocking,  occurs  because  TCP  cannot  logically  separate  independent  HTTP 
responses  in  its  transport  and  delivery  mechanisms. 
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Transport  layer  multistreaming  is  the  ability  of  a  transport  protocol  to 
support  multiple  streams,  where  each  stream  is  a  logical  data  flow  with  its  own 
sequencing  space.  Within  each  stream,  the  transport  receiver  delivers  data  in-sequence 
to  the  application,  without  regard  to  the  relative  order  of  data  arriving  on  other 
streams.  SCTP  [RFC4960]  is  a  standardized  reliable  transport  protocol  which  provides 
multistreaming.  Independent  HTTP  responses  transmitted  over  different  streams  of  an 
SCTP  association  can  be  delivered  to  the  web  browser  without  HOT  blocking. 

While  most  web  users  in  developed  nations  experience  excellent  browsing 
conditions,  a  large  and  growing  portion  of  WWW  users  in  developing  nations 
experience  high  end-to-end  delays  and  loss  rates.  In  such  network  conditions,  persistent 
and  pipelined  HTTP  1.1  transfers  over  TCP  suffer  from  exacerbated  HOT  blocking, 
resulting  in  poor  browsing  experience  (discussed  in  the  next  section).  In  this  work,  we 
evaluate  multistreamed  web  transport’s  ability  to  reduce  HOT  blocking  and  improve  a 
web  user’s  browsing  experience  in  developing  regions. 

2.2  Head-of-line  Blocking 

This  section  introduces  a  model  for  persistent  and  pipelined  HTTP  1.1 
transfer  to  formulate  head-of-Une  (HOT)  blocking.  This  section  also  discusses  various 
factors  that  aggravate  HOT  blocking. 

2.2.1  Model  for  HTTP  1.1,  and  HOL  Blocking 

We  consider  the  following  model  to  understand  HOL  blocking  in  an  HTTP 

1.1  persistent,  pipelined  transfer  containing  N  embedded  objects  (Figure  2.1). 
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User  Screen  Web  Client 


Transport 


Web  Server 


obj^ 


obji 


■objj, 


obji  =  object  i,  0  <  i  <  N.  objo  denotes  index.html,  obJi.  N  denote  N 
embedded  objects  in  index.html. 

reqi  =  time  when  the  web  client  generates  the  HTTP  GET  request  for  obji, 
and  writes  the  request  to  the  transport  layer. 


12 


objt  =  k‘^  piece  of  obji,  0  <k<M;  obj!’  denotes  the  response  header,  and 
obji  "^  denote  the  different  pieces  of  obji.  Note  that  M  depends  on  the  size  of  obji.  In 
our  emulations,  we  assume  aU  objects  are  the  same  size  (M). 

rspt  =  time  when  transport  delivers  objt  to  the  web  client. 
rent  =  time  when  web  client  renders  objt  on  user’s  monitor. 
proct  =  (rent  -  rspt)  denotes  the  web  client’s  processing  time  (e.g., 
decoding,  decompression,  rendering)  for  objt. 

In  HTTP  over  TCP,  if  objt  is  lost  and  recovered  after  x  time  units,  pieces 
of  objj  (j  >  i)  could  be  HOT  blocked  for  x  time  units.  Assuming  the  web  client  is 
currently  rendering  objt'\  if  (x  <  proct'^),  this  instance  of  HOT  blocking  does  not 
affect  response  time  for  objj+i.  Otherwise,  the  HOT  blocking  increases  obji+i’s 
response  time  by  (x  -  proct'^)  time  units  [Diot  1999].  Thus,  the  duration  of  HOT 
blocking  depends  on  the  loss  recovery  period,  x. 

In  both  TCP  and  SCTP,  the  duration  of  loss  recovery  based  on 
retransmission  after  3  duplicate  acks  (fast  retransmit)  takes  ~1  round-trip  time  (RTT), 
and  retransmission  after  timeout  expiration  (timeout  retransmit)  takes  between  the 
initial  retransmission  timeout  value  (RTO)  of  3  seconds  and  the  maximum  of  (IRTT, 
min  RTO  (1  second))  [RFC2988].  Note  that  the  loss  recovery  period  increases  as  the 
path’s  RTT  increases.  Also,  the  frequency  of  HOT  blocking  increases  as  the  loss  rate 
on  the  end-to-end  path  increases.  Intuitively,  HOT  blocking  would  be  exacerbated  over 
a  high  RTT,  lossy  path. 

Apart  from  end-to-end  path  characteristics,  individual  object  sizes  also 
influence  the  degree  of  HOT  blocking.  As  object  size  increases,  the  probability  that  a 
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piece  of  the  object  is  lost  also  increases.  Hence,  a  large  object  in  a  pipelined  transfer  is 
more  likely  to  block  deUvery  of  subsequent  objects  than  a  smaller  object  would. 

2.2.2  Browsing  Conditions  in  Developing  Regions 

Unlike  web  users  in  developed  nations,  a  large  and  growing  portion  of 
WWW  users  in  developing  regions  experience  Internet  delays  ranging  from  lOO’s  of 
milliseconds  to  a  few  seconds.  Such  high  delays  transpire  from  low  bandwidth  and/or 
high  propagation  delay  last  hops,  such  as  VSAT/3G/GPRS  links. 


Due  to  a  multitude  of  factors,  VS  AT  solutions  (Figure  2.2)  are  the  most 
cost-effective  and  efficient  method  of  providing  Internet  connectivity  for  commercial 
customers,  governments  and  consumers  in  developing  nations  and  other  areas  where  a 
land-based  infrastructure  does  not  exist  [WiderNet,  CAfrica,  Tarahaat,  VSAT- 
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systems].  The  suecessful  deployment  of  VS  AT  systems  and  serviees  in  more  than  120 
eountries  provides  eommunities  with  aeeess  to  information,  knowledge,  edueation  and 
business  opportunities,  and  has  been  erueial  in  the  eommunities’  soeio-eeonomie 
development  [Rahman  2002], 

The  propagation  delay  from  ground  station  to  geostationary  satellite  to 
ground  station  is  ~280ms  [Gurtov  2004,  RFC2760].  Therefore,  the  delay  over  a  VSAT 
link  inereases  the  RTT  by  ~560ms.  The  bandwidth- limited  VSAT  Unk  is  most  likely  the 
bottleneek  in  the  transmission  path.  Any  resulting  queuing  and/or  proeessing  delays 
within  the  satellite  further  inerease  the  RTT.  The  delay  eaused  by  shared  channel  access 
over  a  VSAT  link  can  sometimes  increase  the  RTT  on  the  order  of  few  seconds 
[RFC3135]. 

GPRS  and  3G  links  are  characterized  by  variable  and  high  latencies;  the 
RTTs  in  such  networks  can  vary  between  a  few  hundreds  of  milliseconds  to  1  second 
[Chakravorty  2002,  Chan  2002,  RFC3481].  The  proliferation  of  mobile  phones  in 
developing  regions,  and  the  increasing  use  of  web  browsers  and  other  web  applications 
on  mobile  phones  is  another  example  of  web  transfers  over  high  latency  paths.  High 
Speed  Download  Packet  Access  (HSDPA)  technology  is  the  successor  to  3G,  and  is 
emerging  from  research  to  deployment.  HSDPA  offers  improved  broadband  Internet 
access  (~  1Mbps  per  user  per  cell),  and  is  targeted  as  a  viable  option  for  regular  Internet 
connectivity  to  both  residential  and  mobile  customers.  However,  channel  access  and/or 
propagation  delay  on  an  HSDPA  link  adds  ~80ms  to  the  path  RTT  [Jurvansuu  2007], 
which  is  significantly  higher  than  current  wired  last  hop  delays. 

In  addition  to  propagation  delays,  sub-optimal  traffic  routing  increases 
latency  of  Internet  traffic  in  developing  nations  [Baggaley  2007,  Cottrell  2006].  For 
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example,  sub-optimal  routing  for  intra-Afriean  traffie  results  in  Internet  traffie 
traversing  multiple  VS  AT  links,  and/or  being  routed  through  North  Ameriea  or 
Europe,  leading  to  RTTs  as  high  as  2.5  seeonds  [PingER],  Eurthermore,  Internet  traffie 
to/from  developing  regions  traverses  through  lossy  paths,  and  experienees  significant 
end-to-end  loss  rates  [Cottrell  2006,  PingER]. 

Online  U.S.  shoppers  consider  4  seconds  as  the  maximum  acceptable  page 
download  time  before  potentially  abandoning  a  retail  site  [Akamai  2006].  Response 
times  above  4  seconds  interrupt  the  user  experience,  causing  the  user  to  leave  the  site 
or  system.  While  web  users  over  high  latency  and  lossy  paths  in  developing  nations 
must  be  more  tolerant  to  response  times,  these  users  will  prefer  to  use  a  system  that 
provides  better  browsing  experience. 

2.3  Design  of  HTTP  over  SCTP  Streams 

Several  experts  agree  that  the  best  transport  scheme  for  HTTP  would  be 
one  that  supports  datagrams,  provides  TCP  compatible  congestion  control  on  the  entire 
datagram  flow,  and  facilitates  concurrency  in  GET  requests  [Gettys  2002].  When  we 
started  this  work,  SCTP  was  the  only  available  multistreamed  transport,  and  hence 
became  our  default  choice  [Natarajan  2006a].  Afterward,  [Eord  2007]  proposed  a  new 
TCP-based  multistreamed  web  transport.  This  new  transport  protocol  is  similar  to 
SCTP  and  is  discussed  in  Section  2.10. 

Apart  from  multistreaming,  SCTP  offers  other  features  that  are  well  suited 
for  a  web  transport.  Unlike  TCP,  SCTP’s  state  transition  does  not  require  a 
TIME_WAIT  state  [RPC793],  since  the  Initiation  and  Verification  tags  help  to 
associate  SCTP  PDUs  with  the  corresponding  SCTP  associations  [RPC4960].  Note 
that  TCP’s  TIME_WAIT  state  increases  memory  and  processing  overload  at  a  busy 
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web  server  [Faber  1999].  Also,  SCTP’s  COOKIE  meehanism  prevents  SYN  attaeks, 
and  SCTP  multihoming  provides  fault-toleranee  and  the  possibility  of  multipath  transfer 
[Natarajan  2006a]. 

Two  guidelines  governed  our  HTTP  over  SCTP  design: 

•  Make  no  ehanges  to  the  existing  HTTP  speeifieation,  to  reduee  deployment 
eoneerns. 

•  Minimize  SCTP-related  state  information  at  the  server  so  that  SCTP 
multistreaming  does  not  further  eontribute  to  the  server  being  a  performanee 
bottleneek. 

The  independent  nature  of  HTTP  responses  is  most  exploited  by 
downloading  them  on  different  SCTP  streams.  Aeeordingly,  the  important  design 
question  to  address  was:  whieh  end  (eUent  or  server)  should  deeide  the  SCTP  stream  to 
be  used  for  an  HTTP  response?  Having  the  web  server  manage  the  SCTP  stream 
seheduUng  is  undesirable,  as  it  involves  maintaining  additional  state  information  at  the 
server.  Further,  the  eUent  is  better  positioned  to  make  seheduUng  deeisions  that  rely  on 
user  pereeption  and  the  operating  environment.  We  therefore  eoneluded  that  the  eUent 
should  deeide  objeet  seheduUng  on  streams. 

We  eonsidered  two  designs  by  whieh  the  eUent  eonveys  the  seleeted  SCTP 
stream  to  the  web  server:  (1)  the  eUent  speeifies  an  SCTP  stream  number  in  the  HTTP 
GET  request  and  the  server  sends  the  eorresponding  response  on  this  stream,  or  (2)  the 
server  transmits  the  HTTP  response  on  the  same  stream  number  on  whieh  the 
eorresponding  HTTP  request  was  reeeived.  Design  (1)  requires  just  one  ineoming 
stream  and  several  outgoing  streams  at  the  server,  but  requires  modifieations  to  the 
HTTP  GET  request  speeifieation.  Design  (2)  requires  the  server  to  maintain  as  many 
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incoming  streams  as  there  are  outgoing  streams,  increasing  the  memory  overhead  at  the 
server.  Every  inbound  or  outbound  stream  requires  additional  memory  in  the  SCTP 
Protoeol  Control  Block  (PCB),  and  the  amount  of  memory  required  varies  with  the 
SCTP  implementation.  The  reference  SCTP  implementation  on  FreeBSD  (version  6.1), 
requires  25  bytes  for  every  inbound  stream  and  33  bytes  for  every  outbound  stream 
[FreeBSD].  We  considered  this  memory  overhead  per  stream  to  be  insignificant 
compared  to  the  effort  to  modify  the  HTTP  speeification,  and  ehose  option  (2). 


Web  Client 


Web  Server 


IS:  Input  Streams 
OS:  Output  Streams 

Figure  2.3:  Design  of  HTTP  over  SCTP  Streams 


Figure  2.3  gives  an  overview  of  our  HTTP  over  SCTP  design.  A  web 
eUent  and  server  first  negotiate  the  number  of  SCTP  streams  to  use  for  the  web 
transfer.  During  association  establishment,  the  web  eUent  requests  m  inbound  and  m 
outbound  streams.  The  INIT-ACK  from  the  server  earries  the  web  server’s  offer  on  the 
number  of  inbound/outbound  streams  {n).  After  association  establishment,  the  number 
of  inbound  and  number  of  outbound  streams  available  for  HTTP  transaetions,  s  = 
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MIN(m,«).  Note  that  an  SCTP  end  point  can  initially  offer  a  lower  number  of  streams 
and  later  increase  the  offer  using  the  streams  reset  functionality  [Stewart  2008a]. 

When  a  web  server  receives  a  request  on  an  inbound  SCTP  stream  a  {a< 
5),  the  server  sends  the  corresponding  response  on  the  outbound  stream  a.  If  5  < 
number  of  pipelined  requests,  the  web  client  must  schedule  the  requests  over  the 
available  SCTP  streams  using  a  scheduling  policy,  such  as  round-robin. 

2.4  Implementation  in  the  Apache  Weh  Server 

We  chose  the  popular  open  source  Apache  web  server  (version  2.0.55) 
[Apache]  for  our  task.  In  this  section,  we  give  an  overview  of  Apache’s  architecture, 
and  our  adaptations  [Natarajan  2006a]. 

2.4.1  Apache  Architecture 

The  Apache  HTTP  server  (httpd)  has  a  modular  architecture.  The  main 
functions  such  as  server  initialization,  HTTP  request  parsing,  and  memory  management 
are  handled  by  the  core  module.  Accessory  functions  such  as  request  redirection, 
authentication,  dynamic  content  handling  are  performed  by  separate  modules.  The  core 
module  relies  on  Apache  Portable  Runtime  (APR),  a  platform  independent  API,  for 
network,  memory  and  other  system  dependent  functions. 

Apache  uses  filters  —  functions  through  which  different  modules  process 
an  incoming  HTTP  request  {input  filters)  or  an  outgoing  HTTP  response  {output 
filters).  The  core  module’s  input  filter  calls  APR’s  read  API  to  read  HTTP  requests. 
During  request  processing,  all  state  information  related  to  the  request  are  maintained  in 
a  request  structure.  Once  the  response  is  generated,  the  core  module’s  output  filter 
calls  APR’s  send  API  for  transmitting  the  response. 
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Apache  has  a  set  of  multi-processing  arehiteetures  that  ean  be  enabled  at 
eompile  time.  We  eonsidered  the  following  architectures:  (1)  prefork  —  non- threaded 
pre-forking  server  and  (2)  worker  —  hybrid  multi-threaded  multi-proeessing  server. 
With  prefork,  a  eonfigurable  number  of  processes  are  forked  during  server 
initialization,  and  are  setup  to  listen  for  conneetions  from  eUents.  With  worker,  a 
eonfigurable  number  of  server  threads  and  a  listener  thread  are  ereated  per  process. 
The  listener  thread  listens  for  incoming  connections  from  clients,  and  passes  the 
eonneetion  to  a  server  thread  for  further  proeessing.  In  both  arehiteetures,  the  server 
proeesses  or  threads  handle  requests  sequentially  from  a  transport  connection. 

2.4.2  Adapting  Apache 

Apache’s  core  module  and  the  APR  were  modified  to  support  SCTP 
streams.  APR’s  read  and  send  API  implementations  were  modified  to  read  and  transmit 
data  on  a  speeific  SCTP  stream.  Eaeh  time  APR  reads  an  HTTP  request,  the  SCTP 
input  stream  number  is  stored  in  the  corresponding  request  structure,  so  that  the 
response  can  be  written  on  the  equivalent  SCTP  output  stream. 

Apache  uses  directives  that  allow  a  web  administrator  to  configure  various 
parameters  during  server  initialization.  The  syntax  of  the  Listen  directive  was  modified 
so  that  a  web  admin  can  configure  the  transport  protocol  (TCP  or  SCTP)  during 
initialization. 

2.5  Implementation  in  the  Firefox  Weh  Browser 

We  chose  the  Firefox  (version  1.6al)  browser  sinee  it  is  a  widely  used 
open-source  browser.  Firefox  belongs  to  the  Mozilla  suite  of  appUeations  whieh  have  a 
layered  architeeture  [Mozilla].  Mozilla  applications  such  as  Firefox  and  Thunderbird 
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(mail/news  reader),  belong  to  the  top  layer,  and  rely  on  the  services  layer  for  access  to 
network  and  file  I/O.  The  services  layer  uses  platform  independent  APIs  offered  by  the 
Netscape  Portable  Runtime  (NSPR)  library. 

Firefox  has  a  multi- threaded  architecture.  To  render  a  web  page,  the  HTTP 
module  in  the  services  layer  parses  the  URL,  uses  NSPR  to  open  a  TCP  connection  to 
the  appropriate  web  server,  and  downloads  the  web  page.  While  parsing  the  web  page, 
the  HTTP  module  opens  additional  TCP  connections  as  required,  and  pipelines  HTTP 
GET  requests  for  the  embedded  objects. 

Adapting  Firefox  to  work  over  SCTP  streams  involved  modifications  to 
both  NSPR  and  the  HTTP  module. 

2.5.1  Adapting  NSPR 

We  first  modified  NSPR  to  create  and  setup  an  SCTP  socket  instead  of  a 
TCP  socket.  During  association  establishment,  NSPR  requests  a  specific  number  of 
SCTP  input  and  output  streams.  Note  that  this  request  can  be  negotiated  down  by  the 
server.  Therefore,  after  association  establishment,  NSPR  queries  SCTP  for  the  number 
of  input/output  streams  available  for  HTTP  transactions.  Also,  NSPR  was  modified  to 
include  new  SCTP  related  send  and  receive  methods. 

In  the  current  implementation,  HTTP  request  scheduling  over  SCTP 
streams  is  handled  within  NSPR.  Since  the  HTTP  module  is  more  knowledgeable  about 
the  web  page  contents  and  user  preferences,  future  implementations  could  consider 
HTTP  request  scheduling  at  the  HTTP  module. 

In  current  HTTP  request  scheduling,  the  requests  are  transmitted  in  a 
round-robin  fashion  over  SCTP  streams.  Other  scheduling  policies  can  also  be 
considered.  For  example,  in  a  lossy  network  environment,  such  as  wide  area  wireless 
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connectivity  through  GPRS,  a  better  scheduling  policy  might  be  ‘smallest  pending 
object  first’  where  the  next  GET  request  goes  on  the  SCTP  stream  that  has  the  smallest 
sum  of  object  sizes  pending  transfer.  Such  a  policy  reduces  the  probability  of  HOL 
blocking  among  the  responses  downloaded  on  the  same  SCTP  stream. 

2.5.2  Adapting  the  HTTP  Module 

Modifying  the  HTTP  module  turned  out  to  be  more  challenging  than 
expected,  primarily  due  to  Firefox’s  assumptions  about  in-order  data  delivery  within  a 
transport  connection.  Within  the  HTTP  module,  an  nsHttpPipeUne  object  is  responsible 
for  sending  pipelined  requests  and  reading  pipelined  responses.  As  shown  in  Figure  2.4, 
nsHttpPipeUne  creates  an  nsHttpTransaction  object  for  each  request.  An 
nsHttpTransaction  object  is  associated  with  an  nsHttpConnection  object,  which  reads 
the  HTTP  responses  from  NSPR.  Since  pipelined  responses  are  read  back-to-back, 
nsHttpPipeUne  uses  the  response  length  information  (available  in  the  response  header) 
to  distinguish  the  end  of  current  response  from  the  beginning  of  next  response.  In 
effect,  an  nsHttpPipeUne  object  assumes  the  following  about  a  transport  layer 
connection: 

1.  All  pieces  of  one  response  will  be  deUvered  before  any  piece  of  another 
response  is  deUvered.  That  is,  pieces  of  responses  will  not  be  deUvered  in  an 
interspersed  fashion. 

2.  Responses  are  deUvered  in  the  same  sequence  in  which  the  pipeUned 
requests  were  transmitted. 

These  assumptions  hold  when  the  underlying  transport  is  TCP  -  a  reUable 
protocol  deUvering  in-order  data  to  nsHttpPipeUne.  However,  various  factors  result  in 
out-of-order  response  deUvery  in  HTTP  over  SCTP  streams. 
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Figure  2.4:  Modifications  to  Firefox  HTTP  Module 


2.5.2.1  Factors  Affecting  Response  Delivery  in  HTTP  over  SCTP  streams 

As  mentioned  in  Section  2.4.1,  the  current  Apache  implementation  reads 
and  processes  requests  in  succession  (one  after  the  other)  within  a  transport 
connection.  Therefore,  Apache  generates  responses  in  the  same  sequence  that  it  reads 
requests,  i.e.,  Apache’s  response  sequence  (server_response)  equals  its  request 
sequence  {server _request) .  Also,  for  the  following  discussions,  let  the  HTTP  module’s 
transmitted  request  sequence  be  client _request,  and  the  delivered  response  sequence  be 
client_response. 
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2.5.2.1.1  Non  HOL  Blocked  Requests 

Loss  of  HTTP  requests  transmitted  on  stream  i,  does  not  prevent  delivery 
of  successfully  received  requests  on  streamy.  During  request  losses,  server_request  will 
be  different  from  client _request.  Therefore,  the  generated  server _response,  and 
client _response  will  be  different  from  client _request,  violating  nHttpPipeUne’s 
assumption  (2). 

2.5.2.1.2  Non  HOL  Blocked  Responses 

At  Firefox’s  SCTP  layer,  the  loss  of  a  response  on  stream  i,  does  not 
prevent  delivery  of  successfully  received  responses  on  streamy.  During  response  losses, 
client _response  can  be  different  from  client _request,  also  violating  nsHttpPipeUne’s 
assumption  (2). 

2.5.2.1.3  Interaction  between  Apache  and  FreeBSD  SCTP 

SCTP  preserves  message  boundaries.  At  Apache,  data  in  each  write() 
translates  to  an  application  message,  and  this  message  is  delivered  in  its  entirety  to  the 
receiving  application.  SCTP  fragments  a  message  into  Path  MTU  (PMTU)  sized 
TPDUs  before  transmission.  SCTP’s  fragmentation  and  reassembly  process  is  designed 
such  that  all  message  fragments  must  be  assigned  consecutive  Transmission  Sequence 
Numbers  (TSNs).  Therefore,  all  message  fragments  must  be  transmitted  sequentially. 
The  receiving  SCTP  uses  the  (i)  (B)eginning  fragment  bit,  (ii)  sequential  TSNs,  and  (iii) 
(E)nding  fragment  bit  for  correct  reassembly  [RFC4960].  In  effect,  SCTP’s 
fragmentation  and  reassembly  creates  dependencies  in  message  transmission.  A 
fragment  of  message  i+1  cannot  be  transmitted  until  all  fragments  of  message  i  have 
been  transmitted. 
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Apache’s  request  proeessing  rate  is  often  higher  than  SCTP’s  data 
transmission  rate,  espeeially  when  SCTP’s  data  transmission  is  limited  by  low 
bandwidth/high  lateney  links  and/or  paeket  losses.  In  sueh  seenarios,  as  long  as  the 
SCTP  soeket’s  send  buffer  allows,  Apaehe  writes  multiple  HTTP  responses  on  the 
soeket,  and  these  responses  await  transmission  at  the  SCTP  send  buffer.  If  Apaehe 
writes  a  lOOK  response  on  stream  i  followed  by  a  IK  response  on  streamy,  SCTP  will 
not  transmit  the  IK  response  until  all  fragments  of  the  lOOK  response  are  sueeessfully 
transmitted.  Note  that  the  transmission  time  of  the  lOOK  response  inereases  in  low 
bandwidth/high  lateney/high  loss  seenarios.  Sinee  the  lOOK  and  IK  responses  are  self¬ 
regulating,  it  is  highly  desirable  that  browser’s  rendering  of  the  IK  response  does  not 
depend  on  transmission/arrival/rendering  of  the  lOOK  response. 

To  overeome  this  issue,  we  reloeated  message  fragmentation  from  the 
SCTP  layer  to  HTTP  response  fragmentation  at  Apaehe.  Apaehe  writes  an  HTTP 
response  as  multiple  appUeation  messages,  sueh  that,  eaeh  message  at  the  SCTP  layer 
results  in  a  PMTU-sized  TPDU,  and  is  not  fragmented  further  by  SCTP.  An  appUeation 
ean  use  either  the  SCTP_PEER_ADDR  or  the  SCTP_STATUS  soeket  options  to  get 
the  assoeiation’s  PMTU  [Stewart  2008b]. 

HTTP  response  fragmentation  results  in  the  following  interesting 
interaetion  between  Apaehe  and  EreeBSD  SCTP.  The  EreeBSD  SCTP  maintains  a 
queue  of  appUeation  messages  for  eaeh  outbound  stream  in  an  assoeiation.  Note  that 
during  HTTP  response  fragmentation,  the  messages  in  these  queues  translate  to  a  pieee 
of  an  HTTP  response.  The  EreeBSD  SCTP  transmits  messages  from  the  stream  queues 
in  a  round-robin  fashion.  If  an  SCTP  assoeiation  has  m  outbound  streams,  onee  an 
appUeation  message  from  stream  /’s  queue  is  transmitted,  a  message  from  stream  {i+1 
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mod  mj’s  queue  is  eonsidered  for  transmission.  When  Apache’s  request  processing  rate 
is  higher  than  SCTP’s  transmission  rate,  multiple  SCTP  stream  queues  contain 
messages  (pieces  of  HTTP  responses)  awaiting  transmission.  Due  to  FreeBSD  SCTP’s 
round-robin  transmission,  the  HTTP  response  pieces  are  transmitted  in  an  interspersed 
fashion,  and  arrive  in  the  same  fashion  at  Firefox’s  SCTP  layer.  In  fact,  even  under  no 
loss  conditions,  delivery  of  a  piece  of  response  i  can  be  followed  by  a  piece  of  response 
j,  violating  nsHttpPipeUne’s  assumption  (1). 

2.5.2.1.4  Web  Server  Architecture 

Currently,  Apache’s  multi-threaded  architecture  dedicates  a  server  thread 
to  each  transport  connection,  and  the  server  thread  services  requests  in  succession.  We 
envision  a  multi-threaded  server  architecture,  where  multiple  server  threads 
concurrently  serve  requests  on  a  transport  connection  [Natarajan  2006a].  To 
understand  our  motivation  for  the  new  architecture,  consider  the  following  two  cases: 
(i)  current  architecture,  where  a  single  server  thread  serves  responses  1  and  2  in 
succession,  and  (ii)  new  architecture,  where  two  server  threads  concurrently  serve 
responses  1  and  2.  Note  that  the  server  communicates  over  a  single  SCTP  association 
in  both  cases.  However,  the  concurrency  in  case  (ii)  causes  the  initial  pieces  of  both 
responses  to  be  transmitted  sooner  (and  rendered  sooner  by  the  client)  than  case  (i). 
We  call  case  (ii)  object  interleaving  and  discuss  its  advantages  in  [Natarajan  2006a]. 

Now,  assume  that  the  web  server  does  HTTP  response  fragmentation  and 
both  responses  are  transmitted  on  the  same  SCTP  stream.  In  case  (i),  the  server  writes 
all  pieces  of  response  1  on  the  stream  before  writing  response  2.  Therefore,  all  pieces 
of  response  1  are  transmitted  (and  delivered)  to  Firefox  before  any  piece  of  response  2. 
However,  in  case  (ii),  the  two  server  threads  write  concurrently  over  the  same  SCTP 
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stream.  Therefore,  the  response  pieces  can  be  transmitted  and  delivered  in  an 
interspersed  fashion  at  Firefox,  violating  nsHttpPipeUne’s  assumption  (1). 

2.5.2.2  Modifications  to  the  HTTP  Module 

Based  on  our  experience  with  Apache  and  Firefox,  we  feel  that  adapting 
Apache  and  Firefox  to  handle  object  interleaving  is  a  complex  task,  and  it  might  be 
easier  to  develop  a  new  server  and  browser  from  the  scratch.  Nevertheless,  we  reiterate 
that  a  multistreamed  web  transport  opens  up  new  possibilities  such  as  object 
interleaving,  which  can  further  improve  HTTP  performance. 

In  our  Firefox  adaptation  over  SCTP,  nsHttpPipeline’s  assumptions  on 
response  delivery  are  similar  as  before,  but,  this  time  the  assumptions  are  w.r.t.  an 
SCTP  stream  instead  of  a  transport  connection.  nsHttpPipeline  assumes  that,  within  an 
SCTP  stream,  (1)  all  pieces  of  one  response  will  be  delivered  in-order,  before  any  piece 
of  another  response  is  delivered,  and  (2)  responses  are  delivered  in  the  same  sequence 
in  which  the  pipelined  requests  were  transmitted. 

The  HTTP  module  was  modified  as  follows  (Figure  2.4): 

•  nsHttpConnection  maintains  a  table  data  structure  as  shown  in  Figure  2.4.  Each 
entry  in  the  table  is  a  set  of  {SCTP  stream  number,  queue  of  requests 
(nsHttpTransactions  objects)  transmitted  over  the  stream}. 

•  After  transmitting  a  request  over  an  SCTP  output  stream,  nsHttpConnection 
appends  the  corresponding  nsHttpTransaction  object  to  the  tail  of  the  stream’s 
queue. 

•  Whenever  data  can  be  read  from  the  SCTP  socket,  NSPR  first  notifies 
nsHttpConnection  about  the  SCTP  input  stream  number.  NSPR  uses  the 
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MSG_PEEK  flag  and/or  SCTP’s  extended  reeeive  information  strueture 
[Stewart  2008b]  to  gather  this  information. 

•  Onee  nsHttpConneetion  knows  the  SCTP  input  stream,  nsHttpConneetion 
assoeiates  the  reeeived  pieee  of  response  to  the  nsHttpTransaetion  at  the  head 
of  the  stream’s  queue. 

•  When  the  nsHttpTransaetion  objeet  is  read  eompletely,  nsHttpConneetion 
deletes  this  transaetion  from  the  head  of  the  stream  queue,  so  that  the  next 
pieee  of  response  on  the  stream  is  delivered  to  the  new  head  of  queue. 

2.6  Evaluation  Preliminaries 

The  SCTP-enabled  Apaehe  and  Eirefox  were  used  to  evaluate 
improvements  to  web  users’  browsing  experienee  in  Internet  eonditions  found  in  the 
developing  world.  This  seetion  diseusses  evaluation  preliminaries  sueh  as  the  nature  of 
web  workloads  and  experimental  setup. 

2.6.1  Nature  of  Web  Workloads 

Several  web  eharaeterization  studies  have  identifled  eertain  key  properties 
of  the  WWW.  These  properties  have  led  to  a  better  understanding  of  WWW’s  nature, 
and  the  design  of  more  effleient  algorithms  for  improved  WWW  performanee. 

Using  server  logs  from  six  different  web  sites,  ArUtt  et.  al.  identifled 
several  key  web  server  workload  attributes  that  were  eommon  aeross  aU  six  servers 
[ArUtt  1997].  Their  work  also  predieted  that  these  attributes  would  likely  “persist  over 
time”.  Of  these  attributes,  the  following  are  most  relevant  to  our  study:  (i)  both  file  size 
and  transferred  file  size  distributions  are  heavy-tailed  (Pareto),  and  (ii)  the  median 
transferred  file  size  is  small  (<5KB).  A  similar  study  eondueted  several  years  later 
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confirmed  that  the  above  two  attributes  remained  unchanged  over  time  [Williams 
2005].  Also  [Williams  2005]  found  that  the  mean  transferred  file  size  had  slightly 
increased  over  the  years,  due  to  an  increase  in  the  size  of  a  few  large  files.  Other 
studies  such  as  [Houtzager  2003,  Williamson  2003]  agree  on  [ArUtt  1997] ’s  findings 
regarding  transferred  file  size  distribution  and  median  transferred  file  size. 

These  measurement  studies  lead  to  a  consensus  that  unlike  bulk  file  or 
multimedia  transfers,  HTTP  transfers  are  short-lived  flows,  where,  a  typical  web  object 
consists  of  a  small  number  of  TPDUs  and  can  be  transferred  in  a  few  RTTs. 

2.6.2  Experimental  Setup 

The  emulations  were  performed  on  the  FreeBSD  platform  which  had  the 
kernel-space  reference  SCTP  implementation.  The  experimental  setup,  shown  in  Figure 
2.5  uses  three  nodes  running  FreeBSD  6.1:  (i)  a  node  running  the  in-house  TCP  or 
SCTP  HTTP  1.1  client,  (ii)  a  server  running  Apache,  and  (iii)  a  node  running 
Dummynet  [Rizzo  1997]  connecting  the  server  and  client.  Dummynet’s  traffic  shaper 
configures  a  full-duplex  link  between  client  and  server,  with  a  queue  size  of  50  packets 
in  each  direction.  Both  forward  and  reverse  paths  experience  Bernoulli  losses  with  loss 
rates  varying  from  0%-10%  —  typical  of  the  end-to-end  loss  rates  observed  in 
developing  regions  [Cottrell  2006,  PingER]. 

FreeBSD  TCP’s  default  initial  cwnd  is  4MSS  [FreeBSD,  RFC3390].  The 
recommended  initial  cwnd  in  SCTP  is  4MSS  as  well.  FreeBSD  TCP  implements  packet 
counting,  while  SCTP  implements  Appropriate  Byte  Counting  (ABC)  with  L=1 
[RFC4960,  RFC3465].  Additionally,  FreeBSD  TCP  implements  Limited  Transmit 
[RFC3042],  which  enhances  loss  recoveries  for  flows  with  small  cwnds.  Both 
transports  implement  SACKs  and  delayed  acks. 
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2.7  Single  TCP  Connection  vs.  Single  Multistreamed  SCTP  Association 

This  section  compares  an  HTTP  1.1  persistent,  pipelined  transfer  over  a 
single  TCP  connection  vs.  over  a  single  multistreamed  SCTP  association.  The  impact 
of  multiple  transport  connections  is  discussed  in  Section  2.8. 

2.7.1  Experiment  Parameters 

Every  pipelined  transfer  comprises  of  an  index.html  with  N  equal  sized 
embedded  objects  of  following  sizes:  3KB,  5KB,  10KB,  and  15KB.  The  number  of 
embedded  objects  (N)  varies:  5,  10,  and  15.  We  beUeve  these  values  reflect  current 
trends  in  web  pages.  For  example,  the  number  of  embedded  images  in  web  pages  of 
onUne  services  such  as  maps.google.com  and  flickr.com  vary  from  8  to  20.  At  both 
cUent  and  server  nodes,  we  assume  that  the  transport  layer  send  and  receive  buffers  are 
not  the  bottlenecks;  they  are  large  enough  to  hold  all  data  of  pipelined  transfer. 
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The  following  high  latency  browsing  environments  are  considered  for 
evaluation  [Cottrell  2006,  PingER],  Results  for  other  high  latency  environments  such 
as  High  Speed  Download  Packet  Access  (HSDPA)  links  are  available  in  [Natarajan 
2007], 

•  1Mbps  link  with  350ms  RTT  (1Mbps. 350ms):  User  in  South  Asia,  accessing  a 
web  server  in  North  America  over  a  land  line. 

•  1Mbps  link  with  850ms  RTT  (1Mbps. 850ms):  User  in  Africa,  sharing  a  VSAT 
link  to  access  a  web  server  in  North  America. 

•  1Mbps  link  with  1100ms  RTT  (1Mbps.  1100ms):  User  in  Africa,  sharing  a 
VSAT  link  to  access  a  web  server  within  Africa.  The  web  traffic  traverses  at 
least  2  VSAT  links;  the  RTT  over  each  VSAT  Unk  is  ~550ms. 

2.7.2  Results:  Page  Rendering  Times 

A  web  page  is  considered  completely  downloaded  when  Firefox  receives 
the  last  piece  of  pipelined  transfer  from  the  transport  layer  (Figure  2.1).  The  web  page 
is  completely  rendered  when  Firefox  processes  and  draws  this  last  piece  on  the  user’s 
screen.  In  HTTP  over  TCP  (HTTP/TCP),  the  last  piece  of  data  always  belongs  to  the 
last  pipelined  object,  whereas  in  HTTP  over  SCTP  streams  (HTTP/SCTP),  the  last 
piece  of  data  could  belong  to  any  pipelined  object.  In  both  schemes,  rendering  the  last 
piece  of  an  object  depends  on  the  throughput  of  the  underlying  transport  connection. 

Using  terminology  defined  in  Section  2.2  (see  Figure  2.1),  page  rendering 
time  is  defined  as  the  time  from  when  the  browser  sends  the  first  GET  request 

(index.html),  to  the  time  when  the  last  piece  of  the  web  page  is  painted  on  the  screen. 
Page  rendering  time  (T)  =  (ren^  -  req^) 
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Figure  2.6:  Page  Rendering  Times  (N=10) 
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Our  initial  hypotheses  about  SCTP  and  TCP’s  page  rendering  times  were 

as  follows: 

•  Both  SCTP  and  TCP  have  similar  values  for  their  initial  ewnd,  and  employ 
delayed  aeks  with  a  200ms  timer.  Therefore,  we  expeeted  both  TCP  and 
SCTP’s  page  rendering  times  to  be  identieal  when  no  losses  oeeur. 

•  Though  SCTP  and  TCP  eongestion  eontrol  are  similar,  minor  differenees  enable 
better  loss  reeovery  and  inereased  throughput  in  SCTP  [Alamgir  2002],  Unlike 
TCP  whose  SACK  info  is  limited  by  the  spaee  available  for  TCP  options,  the 
size  of  SCTP’s  SACK  ehunk  is  larger  (only  limited  by  the  path  MTU),  and 
therefore  at  times  eontains  more  information  about  lost  TPDUs  than  TCP’s 
SACK.  Also,  FreeBSD’s  SCTP  staek  implements  the  Multiple  Fast  Retransmit 
algorithm  (MFR),  whieh  reduees  the  number  of  timeout  reeoveries  at  the  sender 
[Caro  2006].  Therefore,  as  loss  rates  inerease,  we  expeeted  the  enhaneed  loss 
reeovery  features  to  help  SCTP  outperform  TCP. 

Figure  2.6  shows  the  page  rendering  times  for  N=10,  averaged  over  50 
runs  with  95%  eonfidenee.  Similar  results  for  N=5  and  15  ean  be  found  in  [Natarajan 
2007].  Interestingly,  in  all  3  graphs,  the  results  for  the  no  loss  ease  eontradiet  (i),  and 
TCP’s  rendering  times  are  slightly  (but  not  pereeivably)  better  than  SCTP’s.  Detailed 
investigation  revealed  the  following  differenee  between  the  FreeBSD  6.1  SCTP  and 
TCP  implementations.  SCTP  implements  Appropriate  Byte  Counting  (ABC)  with  L=l. 
During  slow  start,  a  sender  inerements  ewnd  by  IMSS  bytes  for  eaeh  delayed  aek.  The 
TCP  staek  does  paeket  eounting  whieh  results  in  a  more  aggressive  ewnd  inerease 
when  the  eUent  aeks  TCP  PDUs  smaller  than  IMSS  (sueh  as  HTTP  response  headers). 
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We  expect  SCTP  to  perform  similar  to  TCP  when  the  TCP  stack  implements  ABC  with 
L=l. 

As  the  loss  rate  increases,  SCTP’s  enhanced  loss  recovery  offsets  the 
difference  in  SCTP  vs.  TCP  cwnd  evolution.  SCTP  begins  to  perform  better;  the 
difference  even  more  pronounced  for  transfers  containing  larger  objects  (lOK  and 
15K).  For  the  1Mbps.  1 100ms  case,  the  difference  between  SCTP  and  TCP  page 
rendering  times  for  lOK  and  15K  transfers  is  ~6  seconds  at  3%  loss,  and  as  high  as  ~15 
seconds  at  10%  loss.  For  the  same  types  of  transfers,  the  difference  is  -8-10  seconds 
for  10%  loss  in  1Mbps. 350ms  scenario.  Similar  trends  are  observed  in  results  for  N=5 
and  15  as  well  [Natarajan  2007]. 

To  summarize,  SCTP’s  page  rendering  times  are  comparable  to  TCP’s 
during  no  loss,  and  SCTP’s  enhanced  loss  recovery  enables  faster  page  rendering  times 
during  lossy  conditions.  More  importantly,  the  absolute  page  rendering  time  difference 
increases,  and  is  more  visually  perceivable  as  the  end-to-end  delay,  loss  rate,  and 
pipelined  transfer  size  increase. 

2.7.3  Results:  Response  Times  for  Pipelined  Objects 

Persistent  and  pipelined  HTTP  1.1  transfers  over  a  single  TCP  connection 
results  in  sequential  rendering  at  Firefox—  even  if  Firefox’s  TCP  layer  has  downloaded 
all  objects  in  the  pipelined  transfer,  these  independent  objects  are  delivered  to  Firefox 
only  in  a  sequential  manner,  such  that  Firefox  processes  and  renders  at  most  one  object 
at  a  time.  Packet  losses  cause  HOT  blocking  and  further  delay  the  sequential  delivery  of 
independent  objects.  On  the  other  hand,  SCTP  streams  provide  concurrency  in  the 
transfer  and  delivery  of  independent  objects  —  an  SCTP  receiver  can  deliver  object  i+1 
to  Firefox  even  before  object  i  is  completely  delivered  as  long  as  these  two  objects  are 
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transmitted  over  different  SCTP  streams.  This  concurrency  enables  Firefox  to  render 
multiple  objects  in  parallel,  a.k.a.,  concurrent  rendering. 


While  browsers  have  to  open  multiple  TCP  connections  to  achieve 
concurrent  rendering,  concurrent  rendering  is  innate  to  a  multistreamed  web  transport. 
The  browser  tunes  the  concurrency  level  by  simply  adjusting  the  number  of  streams.  An 
SCTP  association  with  one  stream  provides  the  same  concurrency  as  a  single  TCP 
connection,  and  results  in  sequential  rendering.  An  SCTP  association  with  two  streams 
provides  twice  as  much  concurrency  as  sequential  rendering.  A  multistreamed 
association  provides  maximum  concurrency  for  a  pipelined  transfer  when  the  number 
of  streams  equals  the  number  of  objects  in  the  transfer.  Note  that  concurrent  rendering 
remains  unaffected  by  a  further  increase  in  concurrency. 

In  our  initial  investigations,  we  discovered  that  a  multistreamed  web 
transport  enables  concurrent  rendering  even  during  no  losses.  Irrespective  of  packet 
losses,  the  interaction  between  Apache’s  HTTP  response  fragmentation  and  FreeBSD 
SCTP  (Section  2.5.2. 1.3)  causes  Firefox’s  SCTP  layer  to  receive  pieces  of  multiple 
objects  in  an  interleaved  fashion.  The  SCTP  receiver  delivers  these  pieces  of  multiple 
objects  in  an  interspersed  fashion  to  Firefox,  resulting  in  concurrent  rendering  even 
during  no  losses.  During  packet  losses,  SCTP  streams  eliminate  or  reduce  HOT 
blocking,  thus  increasing  the  degree  of  concurrent  rendering.  Concurrent  rendering  is 
demonstrated  in  a  number  of  movies  available  online  at  [Movies]. 

To  reiterate,  the  fundamental  difference  between  sequential  and  concurrent 
rendering  is  that  in  sequential  rendering,  a  piece  of  object  i  is  rendered  only  after 
objects  1  through  i-1  are  completely  rendered,  whereas  in  concurrent  rendering, 
pipelined  objects  are  displayed  independent  of  each  other.  We  use  the  following  metric 
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to  capture  the  concurrency  and  progression  in  the  appearance  of  all  pipelined  objects 
on  the  user’s  screen.  Recall  terminology  from  Section  2.2, 


reqo  =  time  when  browser  sends  HTTP  GET  request  for 

index.html. 

(pretii  -  reqo)  =  time  elapsed  from  the  beginning  of  the  page  download 
{reqo)  to  the  earliest  time  when  at  least  P%  of  object  i  is  rendered. 

pPage  is  defined  as  the  time  elapsed  from  the  beginning  of  page  download 
to  the  earliest  time  when  at  least  P%  of  all  pipelined  objects  are  rendered  on  the  screen, 
i.e.,  pPage  =  MAX  [(prerii  -  reqo)',  1<  i<  N] 

Figure  2.7  plots  the  25%Page,  50%Page,  75%Page  and  ioo%Page  values  for 
N=10,  averaged  over  50  runs.  Transfers  over  SCTP  consider  maximum  concurrency, 
i.e.,  enough  SCTP  streams  are  opened  so  that  every  pipelined  object  is  downloaded  on 
a  different  stream.  Results  for  N=5  and  15  can  be  found  in  [Natarajan  2007].  As 
expected,  ioo%Page  values  for  both  concurrent  (solid  points  connected  by  dotted  lines) 
and  sequential  (hollow  points  connected  by  dashed  lines)  rendering  equal  the 
corresponding  transport’s  page  rendering  times  (7).  Also,  the  pPage  times  in 
concurrent  rendering  are  spread  out  vs.  clustered  together  in  sequential  rendering. 
Concurrent  rendering’s  dispersion  in  pPage  values  signifies  the  parallelism  in  the 
appearance  of  all  10  pipelined  objects. 
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(a):  1Mbps. 350ms 
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(b):  1Mbps. 850ms 
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Pipelined  Transfer  Type 


Sequential  :  25%  q 

50%  - 
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100%  A 

Concurrent :  25%  ■ 
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75%  . 

100%  * 

(c):  1Mbps.  1100ms 

Figure  2.1  \  pPage  Values  for  N=10 
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Both  sequential  and  eoneurrent  rendering  sehemes’  values  are  eomparable 
at  0%  loss.  As  loss  rate  inereases,  the  differenee  in  two  rendering  sehemes’  pPage 
values  inerease.  In  addition,  we  find  that  eoneurrent  rendering  displays  25%-50%  of  all 
pipelined  objeets  mueh  sooner  (relative  differenee  ~4  -  2  times  for  15K,  lOK  and  5K 
objeets)  than  sequential  rendering.  This  result  holds  true  for  N=5  and  15  as  well.  In  the 
following  subseetion,  we  demonstrate  how  this  result  ean  be  leveraged  to  signifieantly 
improve  response  times  for  objeets  sueh  as  progressive  images,  whose  initial  25%-50% 
eontain  suffieient  information  for  the  human  eye  to  pereeive  the  objeet  eontents. 

2.7.4  Concurrent  Rendering  and  Progressive  Images 

Progressive  images  (e.g.,  JPEG,  PNG)  are  eoded  sueh  that  the  initial 
TPDUs  approximate  the  entire  image,  and  sueeessive  TPDUs  gradually  improve  the 
image’s  quality/resolution.  Via  simple  experiments,  we  demonstrate  how  eoneurrent 
rendering  eonsiderably  improves  user  pereeption  of  progressive  images.  The  example 
web  page  eonsists  of  an  initial  IK  image  of  our  lab’s  logo,  followed  by  10  progressive 
JPEG  images  of  world  leaders,  eaeh  of  size  lOK. 

Both  Eirefox  over  TCP  (sequential)  and  Eirefox  over  SCTP  (eoneurrent) 
download  the  example  web  page  over  a  56Kbps  link  with  1080ms  RTT.  The  full  page 
downloads  were  eaptured  as  movies,  and  are  available  online  at  [Movies].  In  the 
snapshots  shown  in  Eigure  2.8,  both  sequential  (left)  and  eoneurrent  (right)  runs 
experieneed  -4.3%  loss.  Both  rendering  sehemes  start  the  download  at  t=0s.  At  t=6s 
(Eigure  2.8a),  the  sequential  seheme  rendered  a  eomplete  image  followed  by  a  good 
quality  2nd  image,  and  the  eoneurrent  seheme  displayed  a  eomplete  image  on  the 
browser  window. 
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(a):  t=6  seconds 


(b):  t=7  seconds 


(c):  t=12  seconds 

Figure  2.8:  Concurrent  Rendering  of  Progressive  Images  (56Kbps.  1080ms;  4.3%  loss) 
At  t=7s  (Figure  2.8b),  sequential  rendering  displays  2  complete  images,  vs. 
concurrent  rendering’s  7  partial  images,  at  least  4  of  which  are  of  good  quality.  At 
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t=12s  (Figure  2.8c),  sequential  rendering  displays  4  complete  images,  whereas 
concurrent  rendering  presents  the  user  with  all  10  images  of  good  quality.  With 
concurrent  rendering,  the  complete  page  is  rendered  only  ~t=23s.  From  t=12s  to  23s, 
all  10  images  get  refined,  but  the  value  added  by  the  refinement  is  negligible  to  the 
human  eye.  Therefore,  the  user  “perceives”  all  images  to  be  complete  by  t=12s,  while 
the  page  rendering  time  is  actually  t=23s.  In  the  sequential  run,  all  10  images  do  not 
appear  on  the  screen  until  t=26s. 

2.7.5  SCTP  Implementation  and  Concurrent  Rendering 

As  mentioned  earlier,  our  primary  reason  for  choosing  the  FreeBSD 
platform  is  the  availability  of  the  SCTP  reference  implementation  on  FreeBSD.  Section 
2.5.2. 1.3  discussed  the  unique  interaction  between  Apache  server  and  FreeBSD 
SCTP’s  round-robin  scheduling  of  application  messages  over  stream  send  queues.  This 
interaction  enabled  concurrent  rendering  even  during  no  packet  losses,  and  increased 
the  degree  of  concurrent  rendering  during  lossy  conditions.  Consequently,  absence  of 
this  interaction  may  lower  the  degree  of  concurrent  rendering.  For  example,  on 
platforms  where  SCTP  implementations  do  FIFO  or  some  other  scheduling  of 
application  messages,  concurrent  rendering’s  pPage  values  may  not  be  as  dispersed  as 
shown  in  Figure  2.7,  but  will  be  more  dispersed  than  the  corresponding  values  for 
sequential  rendering. 

2.8  Multiple  TCP  Connections  vs.  Single  Multistreamed  SCTP  Association 

The  current  workaround  to  reduce  HOT  blocking  and  improve  an  end 
user’s  perceived  WWW  performance  is  to  download  an  HTTP  transfer  over  multiple 
TCP  connections.  This  section  compares  the  two  approaches  proposed  to  improve 
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HTTP  performance  —  multiple  TCP  connections  vs.  a  single  multistreamed  SCTP 
association.  Similar  to  Section  2.7,  investigations  here  focus  on  browsing  conditions 
most  likely  to  exist  in  the  developing  world. 

2.8.1  Background 

In  congestion-controlled  transports  such  as  TCP  and  SCTP,  the  amount  of 
outstanding  (unacknowledged)  data  is  limited  by  the  data  sender’s  cwnd.  Immediately 
after  connection  establishment,  the  sender  can  transmit  up  to  initial  cwnd  bytes  of 
application  data  [RFC3390,  RFC4960].  Until  congestion  detection,  both  TCP  and 
SCTP  employ  the  slow  start  algorithm  that  doubles  the  cwnd  every  RTT. 
Consequently,  the  higher  the  initial  cwnd,  the  faster  the  cwnd  growth  and  more  data 
gets  transmitted  every  RTT.  When  an  application  employs  N  TCP  connections,  during 
the  slow  start  phase,  the  connections’  aggregate  initial  cwnd  and  their  cwnd  growth 
increases  N-fold.  Therefore,  until  congestion  detection,  an  application  employing  N 
TCP  connections  can,  in  theory,  experience  up  to  N  times  more  throughput  than  an 
application  using  a  single  TCP  connection. 

When  a  TCP  or  SCTP  sender  detects  packet  loss,  the  sender  halves  the 
cwnd,  and  enters  the  congestion  avoidance  phase  [Jacobson  1988,  RFC4960].  If  an 
application  employing  N  TCP  connections  experiences  congestion  on  the  transmission 
path,  not  all  of  the  connections  may  suffer  loss.  If  M  of  the  N  open  TCP  connections 
suffer  loss,  the  multiplicative  decrease  factor  for  the  connection  aggregate  is  (i  -  M/2N) 
[Balakrishnan  1998a].  If  this  decrease  factor  is  greater  than  one-half  (which  is  the  case 
unless  all  N  connections  experience  loss,  i.e.,  M<N),  the  connections’  aggregate  cwnd 
and  throughput  increase  after  congestion  detection  is  more  than  N  times  that  of  a  single 
TCP  connection. 
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On  the  whole,  an  appUeation  employing  multiple  TCP  senders  exhibits  an 
aggressive  sending  rate,  and  eonsumes  a  higher  share  of  the  bottleneek  bandwidth  than 
an  application  using  fewer  or  single  TCP  connection(s)  [Mahdavi  1997,  Balakrishnan 
1998a].  Multiple  TCP  connections’  aggressive  sending  behavior  has  been  shown  to 
increase  throughput  for  various  applications  so  far.  [TulUmas  2008]  employs  multiple 
TCP  connections  to  maintain  the  data  streaming  rate  in  multimedia  applications. 
[Sivakumar  2000]  proposes  the  PSockets  library,  which  employs  parallel  TCP 
connections  to  increase  throughput  for  data  intensive  computing  applications. 
Likewise,  we  expect  multiple  TCP  connections  to  improve  HTTP  throughput. 

2.8.2  In-house  HTTP  1.1  Ghent 

The  original  plan  was  to  use  the  Apache  web  server  and  the  Firefox 
browser  for  the  evaluations.  But,  following  initial  investigations,  we  decided  to  employ 
a  custom  built  HTTP  1 . 1  client  instead  of  Firefox  due  to  the  following  reason. 

In  Firefox,  the  number  of  open  transport  connections  to  a  server/proxy  can 
be  easily  modified  via  user  configuration.  Firefox  parses  an  URL,  opens  the  first 
transport  connection  to  the  appropriate  web  server,  and  retrieves  index.html.  After 
parsing  index.html,  Firefox  opens  the  remaining  connection(s)  to  the  server,  and 
pipelines  further  requests  across  all  connection(s)  in  a  round-robin  fashion.  Initial 
investigations  revealed  that  Firefox  delays  pipelining  requests  on  a  new  transport 
connection.  Specifically,  the  first  HTTP  transaction  on  a  transport  connection  is  always 
non-pipeUned.  After  the  successful  receipt  of  the  first  response,  subsequent  requests  on 
the  same  transport  connection  are  then  pipelined.  We  believe  this  behavior  is  Firefox’ s 
means  of  verifying  whether  a  server  supports  persistent  connections  [RFC2616  Section 
8].  However,  this  precautionary  behavior  increases  the  per  connection  transfer  time  by 
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at  least  1  RTT,  and  packet  losses  during  the  first  HTTP  transaction  further  increase  the 
transfer  time.  Clearly,  this  behavior  is  detrimental  to  HTTP  throughput  over  multiple 
TCP  connections.  Also,  this  behavior  interferes  in  the  dynamics  we  are  interested  in 
investigating  -  interaction  between  multiple  TCP  connections  and  HTTP  performance. 
Therefore,  we  developed  a  simple  HTTP  1.1  client,  which  better  models  the  general 
behavior  of  HTTP  1 . 1  over  multiple  transport  connections,  and  does  not  bias  results 
against  multiple  TCP  connections. 

The  in-house  client  reproduces  most  of  Firefox’s  transaction  model,  except 
that  this  client  immediately  starts  pipelining  on  each  new  transport  connection.  The 
client  employs  either  TCP  or  SCTP  for  the  HTTP  transfer.  While  one  or  more  TCP 
connections  are  utilized  for  the  HTTP  1.1  transfer,  the  complete  page  is  downloaded 
using  a  single  multistreamed  SCTP  association  with  maximum  concurrency  (each 
pipelined  transaction  is  retrieved  on  a  different  SCTP  stream).  Additionally,  the  client 
mimics  all  of  Firefox’s  interactions  with  the  transport  layer  such  as  non-blocking 
reads/writes,  and  disabling  the  Nagle  algorithm  [RFC896].  The  following  algorithm 
describes  the  client  in  detail: 

1.  Setup  a  TCP  or  SCTP  socket. 

2.  If  SCTP,  set  appropriate  data  structures  to  request  the  required  number  of 
input  and  output  streams  during  association  establishment. 

3.  Connect  to  the  server. 

4.  Timestamp  “Page  Download  Start  Time”. 

5.  Request  for  index.html. 

6.  Receive  and  process  index.html. 

7.  Make  the  socket  non-blocking,  and  disable  Nagle. 
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8.  While  there  are  more  transport  eonneetions  to  be  opened: 

8.1.  Setup  a  soeket  (non-bloeking,  disable  Nagle). 

8.2.  Conneet  to  the  server. 

9.  While  the  eomplete  page  has  not  been  downloaded: 

9. 1 .  Poll  for  read,  write,  or  error  events  on  soeket(s). 

9.2.  Transmit  pending  requests  on  TCP  eonneetions  or  SCTP 
streams  in  a  round-robin  fashion. 

9.3.  Read  response(s)  from  readable  soeket(s). 

10.  Timestamp  “Page  Download  End  Time”. 

2.8.3  Experiment  Parameters 

The  sample  web  page  used  in  the  emulations  comprises  an  index.html  with 
10  embedded  objects.  All  embedded  objects  are  the  same  size  -  5KB.  The  impact  of 
varying  object  sizes  is  discussed  in  Section  2. 8.4. 3. 

Evaluations  in  Section  2.7  considered  a  1Mbps  last  hop  bandwidth,  which 
is  deemed  to  be  a  costly,  high-end  option  for  an  average  user  in  the  developing  world. 
Therefore,  apart  from  a  1Mbps  last-hop,  the  following  more  limited  last-hop  bandwidths 
found  in  developing  regions  are  considered  [Du  2006]:  64Kbps,  128Kbps,  and 
256Kbps.  Also,  the  following  end-to-end  propagation  delays  are  considered  [Cottrell 
2006,  PingER]: 

•  200ms  RTT:  User  in  East  Asia,  accessing  a  web  server  in  North  America  over  a 
land  line. 

•  350ms  RTT:  User  in  South  Asia,  accessing  a  web  server  in  North  America  over 
a  land  line. 

•  650ms  RTT:  User  accessing  a  web  server  over  a  shared  VSAT  link. 
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Figure  2.9:  HTTP  Throughput  (Object  Size  =  5K) 


45 


The  FreeBSD  TCP  implementation  tracks  numerous  sender  and  receiver 
related  statistics  including  the  number  of  timeout  recoveries,  and  fast  retransmits.  After 
each  TCP  run,  some  of  these  statistics  were  gathered  either  directly  from  the  TCP  stack 
or  using  the  netstat  utility. 

2.8.4  Results:  HTTP  Throughput 

The  HTTP  page  download  time  is  measured  as  “Page  Download  End 
Time”  -  “Page  Download  Start  Time”  (Section  2.8.2).  Figure  2.9  shows  the  HTTP 
page  download  times  over  a  single  multistreamed  SCTP  association  (a.k.a.  SCTP)  vs. 
N  TCP  connections  (N=l,  2,  4,  6,  8,  10;  a.k.a.  N-TCP)  for  the  64Kbps,  128Kbps  and 
1Mbps  bandwidth  scenarios.  Results  for  256Kbps  bandwidth  scenario  can  be  found  in 
[Natarajan  2008d].  Note  that  each  embedded  object  is  transmitted  on  a  different  TCP 
connection  in  10-TCP,  and  employing  more  TCP  connections  is  unnecessary.  The 
values  in  Figure  2.9  are  averaged  over  40  runs  (up  to  60  runs  for  the  10%  loss  case), 
and  plotted  with  95%  confidence  intervals. 

2.8.4. 1  During  No  Congestion 

Evaluations  with  0%  loss  (Figure  2.9)  help  understand  the  behavior  of 
multiple  TCPs  during  congestion.  As  mentioned  earlier,  the  initial  cwnds  of  both  TCP 
and  SCTP  are  similar  —  4MSS.  Since  there  is  no  loss,  both  transports  employ  slow 
start  during  the  entire  page  download.  This  equivalent  behavior  results  in  similar 
throughputs  between  SCTP  and  1-TCP  in  64Kbps  and  128Kbps  bandwidths.  Recall 
from  Section  2.7.2  that  the  packet-counting  FreeBSD  6.1  TCP  sender  increases  its 
cwnd  more  aggressively  than  an  SCTP  sender.  As  the  available  bandwidth  increases 
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(256Kbps,  1Mbps),  this  difference  in  cwnd  growth  facilitates  1-TCP  to  slightly 
outperform  SCTP  [Natarajan  2008d]. 

As  mentioned  in  Section  2.8.1,  N-TCP’s  aggressive  sending  rate  can 
increase  an  application’s  throughput  by  up  to  N  times  during  slow  start.  Therefore,  as 
the  number  of  TCP  senders  increase,  we  expected  multiple  TCPs  to  outperform  both  1- 
TCP  and  SCTP.  Surprisingly,  the  results  indicate  that  multiple  TCPs  perform  similar  to 
1-TCP  at  1Mbps  and  256Kbps  bandwidths  [Natarajan  2008d].  As  bandwidth 
decreases,  multiple  TCPs  perform  similar  or  worse  (!)  than  both  1-TCP  and  SCTP. 
Further  investigation  revealed  the  following  reasons. 

2.8.4.1.1  Throughput  Limited  hy  Bottleneck  Bandwidth 

Low  bandwidth  pipes  can  transmit  only  a  few  packets  per  second.  For 
example,  a  64Kbps  bottleneck  cannot  transmit  more  than  ~5.3  1500byte  PDUs  per 
second  or  roughly  1  PDU  per  200ms  RTT.  A  single  TCP  sender’s  initial  cwnd  allows 
the  server  to  transmit  4MSS  bytes  of  pipelined  responses  back-to-back,  causing  a  low 
bandwidth  pipe  (64Kbps,  128Kbps,  and  256Kbps)  to  be  fully  utilized  during  the  entire 
RTT.  More  data  transmitted  during  this  RTT  cannot  be  forwarded,  and  gets  queued  at 
the  bottleneck  router.  Therefore,  data  transmitted  by  N>2  TCP  senders  do  not 
contribute  to  reducing  page  download  times,  and  N-TCPs  perform  similar  to  1-TCP  in 
64Kbps  (N=10),  128Kbps  (N=8,  10),  and  256Kbps  (N>2)  bandwidths  [Natarajan 
2008d].  The  1Mbps  bottleneck  is  completely  utilized  by  the  initial  cwnd  of  N=4  TCP 
senders  (~16  1500byte  PDUs  per  RTT).  Therefore,  2<N<4  TCP  senders  slightly 
improve  page  download  times  when  compared  to  1-TCP  and  N>4  TCP  senders  do  not 
further  reduce  page  download  times. 
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As  the  propagation  delay  and  RTT  inerease,  the  bottleneek  router  forwards 
more  paekets  per  RTT.  For  example,  the  1Mbps  pipe  ean  transmit  ~53  PDUs  per  RTT 
in  the  650ms  seenario  vs.  ~16  PDUs  per  RTT  in  the  200ms  seenario.  Consequently, 
more  TCP  senders  help  fully  utilize  the  1Mbps  pipe  at  650ms  RTT,  and  N-TCPs 
deerease  page  download  times  [Natarajan  2008d].  However,  similar  to  the  200ms  RTT 
seenario,  lower  bandwidths  limit  HTTP  throughput,  and  N-TCPs  perform  similar  to  1- 
TCP  in  the  350ms  and  650ms  RTTs  [Natarajan  2008d] 

To  summarize,  HTTP  throughput  improvement  is  limited  by  the  available 
bandwidth  in  a  low  bandwidth  last  hop.  As  bandwidth  decreases,  fewer  TCP  senders 
willfully  utilize  the  available  bandwidth,  and  additional  TCP  senders  just  increase  the 
queuing  delay  and  decrease  throughput. 

2.8.4.1.2  Queuing  Delay  at  the  Bottleneck 

Figure  2.10  shows  the  mean  number  of  timeout  expirations  on  data  at  the 
server  for  the  64Kbps,  128Kbps  and  1Mbps  bandwidth  seenarios.  Note  that  the  values 
plotted  are  the  mean  timeouts  per  HTTP  transfer.  When  N>1  TCP  senders  are 
employed  for  the  HTTP  transfer,  the  plotted  values  denote  the  sum  of  timeouts  aeross 
all  N  senders.  We  first  foeus  on  the  values  at  0%  loss.  Surprisingly,  exeept  1Mbps, 
some  TCP  sender(s)  in  the  other  bandwidth  seenarios  undergo  timeout  reeoveries. 
Sinee  no  paekets  were  lost,  these  timeouts  must  be  spurious,  and  are  due  to  the 
following. 

During  eonneetion  establishment,  a  FreeBSD  TCP  sender  estimates  the 
RTT,  and  ealeulates  the  retransmission  timeout  value  (RTO)  [FreeBSD,  RFC2988]. 
For  a  200ms  RTT,  the  ealeulated  RTO  equals  the  reeommended  minimum  of  1  seeond 
[RFC2988]. 
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(a):  64Kbps. 200ms 


(b):  128Kbps.200ms 


Loss  Rate 
(c):  1Mbps. 200ms 

Figure  2.10:  RTO  Expirations  on  Data  at  Server  (Object  Size  =  5K) 
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Connection  establishment  is  soon  followed  by  data  transfer  from  the  server.  Lower 
bandwidth  translates  to  higher  transmission  and  queuing  delays.  In  a  64Kbps  pipe,  the 
transmission  of  one  ISOObyte  PDU  takes  ~186ms,  and  a  queue  of  ~5  such  PDUs 
gradually  increases  the  queuing  delay  and  the  RTT  to  more  than  1  second.  When 
outstanding  data  remains  unacknowledged  for  more  than  the  1  second  RTO,  the  TCP 
sender(s)  (wrongly)  assume  data  loss,  and  spuriously  timeout  and  retransmit 
unacknowledged  data. 

As  the  number  of  TCP  senders  increase,  more  packets  arrive  at  the 
bottleneck,  and  the  increased  queuing  delay  triggers  spurious  timeouts  at  a  greater 
number  of  TCP  senders.  Of  the  4  bandwidth  scenarios  considered,  thelMbps  transfers 
experience  the  smallest  queuing  delay,  and  do  not  suffer  from  spurious  timeouts.  the 
bottleneck  bandwidth  decreases,  queuing  delay  increases.  Therefore  HTTP  transfers 
over  smaller  bandwidths  experience  more  spurious  timeouts. 

A  spurious  timeout  is  followed  by  unnecessary  retransmissions  and  cwnd 
reduction.  If  the  TCP  sender  has  more  data  pending  transmission,  spurious  timeouts 
delay  new  data  transmission,  and  increase  page  download  times  (N=2,  4,  6,  8  TCP  in 
64Kbps,  and  N=4,  6  TCP  in  128Kbps).  As  the  number  of  TCP  connections  increase, 
fewer  HTTP  responses  are  transmitted  per  connection.  For  example,  each  HTTP 
response  is  transmitted  on  a  different  connection  in  10-TCP.  Though  the  number  of 
spurious  timeouts  (and  unnecessary  retransmissions)  is  highest  in  10-TCP,  the  TCP 
receiver  delivers  the  first  copy  of  data  to  the  HTTP  client,  and  discards  the  spuriously 
retransmitted  copies.  Therefore,  10-TCP’ s  page  download  times  are  unaffected  by  the 
spurious  timeouts.  Nonetheless,  spurious  timeouts  cause  wasteful  retransmissions  that 
compete  with  other  flows  for  the  already  scarce  available  bandwidth. 
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As  the  propagation  delay  inereases,  the  RTO  calculated  during  connection 
establishment  is  increased  (>  1  second).  Since  transmission  and  queuing  delays  remain 
unaffected,  they  impact  the  RTT  less  at  higher  propagation  delays.  Consequently, 
spurious  timeouts  slightly  decrease  at  350ms  and  650ms  RTTs,  but  still  remain 
significant  at  lower  bandwidths,  and  increase  page  download  times  [Natarajan  2008d]. 

To  summarize,  the  aggressive  sending  rate  of  multiple  TCP  senders  during 
slow  start  does  NOT  necessarily  translate  to  improved  HTTP  throughput  in  low 
bandwidth  last  hops.  Bursty  data  transmission  from  multiple  TCP  senders  increases 
queuing  delay  causing  spurious  timeouts.  The  unnecessary  retransmissions  following 
spurious  timeouts  (i)  compete  for  the  already  scarce  available  bandwidth,  and  (ii) 
adversely  impact  HTTP  throughput  when  compared  to  1-TCP  or  SCTP.  The 
throughput  degradation  is  more  noticeable  as  the  bottleneck  bandwidth  decreases. 

2.8.4.2  During  Congestion 

Though  SCTP  and  TCP  congestion  control  are  similar,  minor  differences 
such  as  SCTP’s  byte  counting  and  more  accurate  gap-ack  information  improve  SCTP’s 
loss  recovery  and  throughput  (Section  2.7.2).  As  the  loss  rate  increases,  SCTP’s  better 
congestion  control  offsets  FreeBSD  TCP’s  extra  ack  advantage  during  no  loss,  and 
SCTP  outperforms  1-TCP. 

Recall  from  Section  2.8.1  that  N-TCPs’  (N>1)  aggressive  sending  rate 
during  congestion  avoidance  can,  in  theory,  increase  throughput  by  more  than  N  times. 
Therefore,  we  expected  multiple  TCPs  to  outperform  both  1-TCP  and  SCTP.  On  the 
contrary,  multiple  TCP  connections  worsen  HTTP  page  download  times,  and  the 
degradation  becomes  more  pronounced  as  loss  rate  increases.  This  observation  is  true 
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for  all  4  bandwidth  scenarios  studied.  Further  investigation  revealed  the  following 
reasons. 

2.8.4.2.1  Increased  Number  of  Timeout  Recoveries  at  the  Server 

For  every  loss  rate,  the  mean  number  of  timeout  expirations  at  the  server 
increases  as  the  number  of  TCP  senders  increases  (Figure  2.10).  Section  2.8.4. 1.2 
discussed  how  increased  queuing  delays  cause  spurious  timeouts  even  at  0%  loss.  Such 
spurious  timeouts,  observed  during  lossy  conditions  as  well,  delay  new  data 
transmission,  thus  worsening  HTTP  page  download  times. 

Recall  that  the  1Mbps  transfers  did  not  suffer  spurious  timeouts  (0%  loss 
in  Figure  2.10e).  However,  multiple  TCPs  still  amplify  timeout  expirations  in  1Mbps 
transfers.  Further  investigation  revealed  that  multiple  TCPs  reduce  ack  information 
which  is  crucial  for  fast  retransmit-based  loss  recoveries. 

Figure  2.11  shows  the  average  number  of  bytes  retransmitted  during  TCP 
SACK  recovery  episodes  (fast  recovery)  in  the  64Kbps  and  1Mbps  transfers, 
respectively.  (Results  for  the  other  intermediate  bandwidths  were  similar  and  hence  not 
shown.).  Eaeh  value  represents  retransmissions  from  the  server  to  client,  and  does  not 
include  retransmissions  after  timeout  expirations.  Similar  to  values  in  Figure  2.10,  eaeh 
value  in  Figure  2.11  represents  the  average  bytes  retransmitted  per  HTTP  transfer,  i.e., 
bytes  retransmitted  by  all  N  TCP  senders. 

During  0%  loss,  data  is  always  received  in-order  at  the  client.  The  acks 
from  client  to  server  contain  no  SACK  blocks,  and  the  server  does  not  undergo  SACK 
recoveries  (Figure  2.11).  During  loss,  data  received  out-of-order  at  the  client  triggers 
dupacks  containing  SACK  blocks.  On  receiving  3  dupacks,  a  TCP  sender  enters  SACK 
reeovery  and  fast  retransmits  missing  data  [FreeBSD].  Higher  loss  rates  trigger  more 
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SACK  recovery  episodes,  and  increase  retransmissions  during  SACK  recoveries 
(Figure  2.11).  However,  for  a  given  loss  rate,  the  retransmissions  decrease  as  the 
number  of  TCP  connections  increase.  That  is,  for  the  same  fraction  of  lost  HTTP  data 
(same  loss  rate),  loss  recoveries  based  on  fast  retransmits  decrease  as  the  number  of 
TCP  senders  increase. 


Figure  2.11:  Fast  Retransmits  during  SACK  Recovery  (Object  Size  =  5K) 

Note  that  loss  recovery  based  on  fast  retransmit  relies  on  dupack 
information  from  the  client.  As  the  number  of  TCP  connections  increase,  data 
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transmitted  per  connection  decreases,  thus  reducing  the  number  of  potential  dupacks 
arriving  at  each  TCP  sender.  Ack  losses  on  the  reverse  path  further  decrease  the 
number  of  dupacks  received.  While  the  TCP  senders  implement  Limited  Transmit 
[RFC3042]  to  increase  dupack  information,  the  applicability  of  Limited  Transmit 
diminishes  as  the  amount  of  data  transmitted  per  TCP  connection  decreases. 

In  summary,  increasing  the  number  of  TCP  connections  decreases  per 
connection  dupack  information.  Fewer  dupacks  reduce  the  chances  of  fast  retransmit- 
based  loss  recovery,  resulting  in  each  sender  performing  more  timeout-based  loss 
recoveries. 

2.8.4.2.2  Increased  Connection  Establishment  Latency 

The  in-house  HTTP  client,  which  closely  resembles  Firefox’s  transaction 
model,  first  opens  a  single  TCP  connection  to  the  server,  and  retrieves  and  parses 
index.html.  Then,  the  client  establishes  more  TCP  connection(s)  for  requesting 
embedded  objects  in  a  pipelined  fashion.  Note  that  HTTP  requests  can  be  transmitted 
over  these  connections  only  after  successful  connection  establishment,  i.e.,  only  when 
the  TCP  client  has  successfully  sent  a  SYN  and  received  a  SYN-ACK.  Any  delay  in 
connection  establishment  due  to  SYN  or  SYN-ACK  loss  delays  HTTP  request  (and 
response)  transmission. 

Figure  2.12  shows  the  average  number  of  SYN  or  SYN-ACK 
retransmissions  for  the  64Kbps  and  1Mbps  transfers,  respectively.  (Results  for  the 
other  intermediate  bandwidths  were  similar  and  hence  not  shown.)  When  multiple  TCP 
connections  are  employed  for  an  HTTP  transfer,  the  number  of  SYN,  SYN-ACK 
packets  increase,  and  the  probability  of  a  SYN  or  SYN-ACK  loss  increases.  Therefore, 
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the  number  of  SYN  or  SYN-ACK  retransmissions  tends  to  increase  as  the  number  of 


TCP  connections  increase. 


A  SYN  or  SYN-ACK  loss  can  be  recovered  only  after  the  recommended 
initial  RTO  value  of  3  seconds  [RFC2988],  and  increases  the  HTTP  page  download 
time  by  at  least  3  seconds.  Consequently,  losses  during  connection  establishment 
degrade  HTTP  throughput  more  when  the  time  taken  to  download  HTTP  responses 
(after  connection  establishment)  is  smaller  compared  to  the  initial  RTO  value. 


Loss  Rate 
(b):  1Mbps. 200ms 

Figure  2.12:  SYN  or  SYN-ACK  Retransmissions  (Object  Size  =  5K) 
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(b):  128Kbps.200ms 


Loss  Rate 
(c):  1Mbps. 200ms 

Figure  2.13:  HTTP  Throughput  (Object  Size  =  lOK) 
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Figure  2.14:  RTO  Expirations  on  Data  at  Server  (1Mbps. 200ms;  Object  Size  =  lOK) 


2.8.4.3  Impact  of  Varying  Object  Sizes 

To  investigate  how  object  size  impacts  HTTP  throughput,  we  repeated  the 
emulations  with  larger  (lOK)  embedded  objects.  The  results  are  shown  in  Figure  2.13. 
Comparing  Figures  2.9  and  2.13,  we  see  that  the  trends  between  1-TCP  and  multiple 
TCPs  remain  similar  between  the  5K  and  lOK  transfers  for  all  bandwidth  scenarios 
except  1Mbps.  In  1Mbps,  N-TCPs  perform  better  than  1-TCP,  and  the  improvement  is 
more  pronounced  at  higher  loss  rates. 

Figure  2.14  shows  the  server’s  mean  timeout  recoveries  for  the  lOK 
transfers  in  the  1Mbps  scenario.  Comparing  values  in  Figure  2.14  with  Figure  2.10c, 
we  see  that  the  lOK  transfers  suffered  fewer  timeout  recoveries  per  transfer  time  unit 
than  5K  transfers.  In  the  lOK  transfers,  each  TCP  sender  transfers  more  data  and 
receives  more  dupacks  per  TCP  connection  than  the  5K  transfers  (Section  2. 8.4. 2.1). 
The  increased  flow  of  acks  in  the  lOK  transfers  triggered  more  fast-retransmissions  in 
SACK  recovery  episodes,  and  fewer  timeout-based  recoveries  compared  to  the  5K 
transfers  (Figure  2.14).  Consequently,  N-TCPs  improved  HTTP  throughput  in  the  lOK 
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transfers.  However,  as  the  last  hop  bandwidth  decreases,  the  negative  consequences  of 
multiple  TCP  senders,  such  as  increased  queuing  delay  and  connection  establishment 
latency,  increase  the  page  download  times,  and  N-TCPs  perform  similar  to  or  worse 
than  1-TCP.  More  importantly,  note  that,  SCTP’s  enhanced  loss  recovery  helps 
outperform  N-TCPs  even  in  the  lOK  transfers. 

To  summarize,  object  size  affects  HTTP  throughput  over  multiple  TCP 
connections.  Smaller  objects  reduce  dupack  information  per  TCP  connection  and 
degrade  HTTP  throughput  more  than  bigger  objects.  However,  the  impact  of  object 
size  decreases,  and  the  negative  consequences  of  multiple  TCP  senders  dominate  more 
and  bring  down  HTTP  throughput  at  lower  bandwidths. 

2.9  Conclusion,  Ongoing  and  Future  Work 

We  examined  HOT  blocking  and  its  effects  on  web  response  times  in 
HTTP  over  TCP.  We  proposed  a  multistreamed  web  transport  such  as  SCTP  to 
alleviate  HOT  blocking,  and  designed  and  implemented  HTTP  over  SCTP  in  the 
Apache  web  server  and  Firefox  browser.  Emulation  evaluations  demonstrate  that 
HTTP  over  TCP  suffers  from  exacerbated  HOL  blocking  which  worsened  response 
times  in  the  high  latency  and  lossy  browsing  conditions  found  in  the  developing  world. 
On  the  contrary,  SCTP  streams  eliminate  inter-object  HOL  blocking,  and  improve  web 
response  times.  The  improvements  are  more  visually  perceivable  in  high  latency  and 
lossy  end-to-end  paths  found  in  the  developing  world. 

The  current  workaround  to  improve  an  end  user’s  perceived  WWW 
performance  is  to  download  an  HTTP  transfer  over  multiple  TCP  connections.  While 
we  expected  multiple  TCP  connections  to  improve  HTTP  throughput,  emulation 
results  showed  that  the  competing  and  bursty  nature  of  multiple  TCP  senders  degraded 
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HTTP  performance  especially  in  low  bandwidth  last  hops.  In  such  browsing  conditions, 
a  single  multistreamed  SCTP  association  not  only  eliminates  HOT  blocking,  but  also 
boosts  throughput  compared  to  multiple  TCP  connections. 

Our  body  of  work  in  HTTP  over  SCTP  has  stimulated  significant  interest 
in  the  area.  The  Protocol  Engineering  Lab  has  also  secured  funding  through  Cisco 
Systems’  University  Research  Program  for  some  of  the  ongoing  activity  discussed 
below. 

2.9.1  IETF  Internet  Draft 

We  have  proposed  an  Internet  Draft  (ID)  to  standardize  our  HTTP  over 
SCTP  streams  design  [Natarajan  2008f].  This  ID  was  presented  at  the  ly^  IETF 
Meeting  held  at  Minneapolis  in  November  2008.  The  objectives  of  this  ID  are:  (i)  to 
highlight  SCTP  services  that  better  match  the  needs  of  HTTP-based  applications,  (ii)  to 
propose  the  HTTP  over  SCTP  streams  design,  and  (iii)  to  share  important  lessons 
learnt  while  implementing  HTTP  over  SCTP  in  Apache  and  Firefox. 

2.9.2  SCTP-enabled  Apache  and  Firefox 

Jonathan  Leighton  is  heading  this  on-going  effort  to  integrate  our  HTTP 
over  SCTP  design  and  implementation  into  the  Firefox  distribution  from  mozilla.org, 
and  the  Apache  distribution  from  apache.org.  The  current  activity  is  focused  on 
integrating  SCTP  related  APIs  in  the  Netscape  Portable  Runtime  (NSPR)  API  and  the 
Apache  Portable  Runtime  (APR)  API,  which  offer  platform  independent  network 
implementations  to  Firefox  and  Apache,  respectively.  Subsequent  work  will  focus  on 
modifying  Firefox  and  Apache  to  take  advantage  of  these  SCTP  related  APIs,  and 
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enabling  appropriate  SCTP  related  eompile  options  for  various  platforms  and  SCTP 
implementations. 

2.9.3  Minimizing  Resource  Requirements 

As  mentioned  in  Seetion  2.8,  today’s  web  browsers  reduee  HOL  bloeking 
in  HTTP  over  TCP  by  downloading  an  HTTP  transfer  over  multiple  TCP  eonneetions. 
In  eontrast,  a  browser  over  SCTP  eliminates  HOL  bloeking  by  simply  inereasing  the 
number  of  streams  in  the  SCTP  assoeiation.  Eaeh  TCP  eonneetion  or  a  pair  of  SCTP 
streams  (inbound/outbound)  inereases  the  proeessing  and  resouree  overhead  at  the  web 
server  or  proxy.  However,  the  resourees  required  to  support  a  new  pair  of  SCTP 
streams  is  mueh  less  eompared  to  a  new  TCP  eonneetion.  For  example,  on  FreeBSD 
eaeh  inbound  or  outbound  SCTP  stream  requires  an  additional  28  or  32  bytes, 
respeetively,  in  the  SCTP  Protoeol  Control  Bloek  (PCB),  while  a  new  TCP  PCB 
requires  -700  bytes  [FreeBSD].  The  differenee  in  TCP  vs.  SCTP  resouree 
requirements  inereases  with  the  number  of  eUents,  and  ean  be  signifieant  at  a  web 
server  farm  handling  thousands  of  eUents.  This  differenee  ean  also  be  signifieant  at 
intermediate  entities  sueh  as  web  eaehes  that  serve  many  web  eUents  and/or  other 
eaehes  [Squid]. 

The  absolute  differenee  in  TCP  vs.  SCTP  resouree  requirements  depends 
not  only  on  the  respeetive  protoeol  implementations  but  also  on  how  optimal  the 
implementations  are.  While  the  TCP  staek  has  been  optimized  over  the  past  two 
deeades,  the  SCTP  staek  is  relatively  new,  and  the  SCTP  referenee  implementation  on 
FreeBSD  ean  be  optimized  further.  For  example,  Randall  Stewart,  the  designer  of 
FreeBSD  SCTP  estimates  that  the  FreeBSD  SCTP  PCB  size  ean  be  redueed  by  -600 
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bytes.  Evaluating  TCP  vs.  SCTP  resource  usage  make  more  sense  after  such 
optimizations  are  in  place. 

2.9.4  Impact  on  Developing  Regions 

While  HTTP  over  SCTP  promises  better  response  times  in  high 
propagation  delay/low  bandwidth/lossy  browsing  conditions,  it  is  impractical  to  expect 
all  web  servers  to  provide  web  over  SCTP  in  the  immediate  future,  without  which 
SCTP’s  benefits  cannot  be  leveraged.  To  address  this  issue,  we  propose  a  realistic,  low 
cost,  gateway-based  solution  that  translates  HTTP  over  TCP  to  HTTP  over  SCTP 
streams  for  easier  and  localized  deployment.  The  solution  assumes  that  the  web 
browser  is  capable  of  HTTP  over  SCTP,  similar  to  the  SCTP-enabled,  freely  available 
Firefox  browser  used  in  our  emulations.  The  gateway  is  physically  positioned  between 
the  server  and  client,  such  that,  the  gateway  talks  SCTP  to  clients  over  the  last  hop 
with  high  propagation  delay  and/or  low  bandwidth,  and  talks  TCP  to  web  servers  in  the 
outside  world.  For  the  architecture  shown  in  Figure  2.2,  the  gateway  is  positioned 
between  the  VS  AT  ground  station  (on  the  left)  and  the  Internet  cloud.  We  believe  that 
the  “proxy”  configuration  in  the  SCTP-enabled  Apache  server  is  a  good  starting  point 
to  achieve  the  gateway  functionality  at  minimal  monetary  cost  [Apache]. 

At  a  minimum,  a  gateway  solution  should  provide  faster  page  downloads 
than  HTTP  over  TCP.  This  solution  can  be  extended  to  further  enhance  pipelined 
objects’  response  times.  For  example,  the  gateway  could  use  batch  image  conversion 
software  [Gimp]  to  convert  embedded  non-progressive  JPEG  or  PNG  images  to  their 
corresponding  progressive  versions  before  forwarding  them  to  the  clients.  Image 
conversion  at  the  gateway  takes  on  the  order  of  milliseconds  per  image,  but  can 
improve  a  user’s  response  times  on  the  order  of  seconds. 
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2.10  Related  Work 


Significant  interest  exists  for  designing  new  transport  and  session  protocols 
that  better  suit  the  needs  of  HTTP-based  client-server  applications  than  TCP.  As 
mentioned  earlier,  several  experts  agree  that  the  best  transport  scheme  for  HTTP 
would  be  one  that  supports  datagrams,  provides  TCP  compatible  congestion  control  on 
the  entire  datagram  flow,  and  facilitates  concurrency  in  GET  requests  [Gettys  2002], 
WebMUX  [Gettys  1998]  was  one  such  session  management  protocol  that  was  a 
product  of  the  (now  historic)  HTTP-NG  working  group  [HTTP-NG].  WebMUX 
proposed  using  a  reliable  transport  protocol  to  provide  web  transfers  with  streams  for 
transmitting  independent  objects.  However,  the  WebMUX  effort  did  not  mature. 

[Ford  2007]  proposes  the  use  of  Structured  Stream  Transport  (SST)  for 
web  transfers.  SST  was  proposed  after  [Natarajan  2006a]  and  functions  similar  to 
SCTP  streams.  SST  extends  TCP  to  provide  multiple  streams  over  a  TCP-friendly 
transport  connection.  Simulation-based  evaluations  in  [Ford  2007]  show  that  SST 
provides  similar  page  download  times  as  TCP.  The  primary  contribution  of  a 
multistreamed  web  transport  is  the  reduction  in  HOF  blocking,  which  is  the  focus  of 
our  work.  Using  real  implementations,  we  show  that  reduced  HOF  blocking  in  HTTP 
over  SCTP  results  in  visually  perceivable  improvements  to  individual  objects’  response 
times  in  browsing  conditions  typical  of  developing  regions.  Also,  we  note  that  SCTP  is 
a  standardized  IETF  protocol  with  many  fine-tuned  kernel  space  implementations, 
while  SST  is  a  research  protocol  yet  to  be  standardized. 

Apart  from  new  session  and  transport  protocols,  other  sender-side 
techniques  focus  on  reducing  the  adverse  effects  of  the  current  workaround  to  reduce 
HOF  blocking  —  parallel  TCP  connections.  The  Congestion  Manager  (CM)  [RFC3124] 
is  a  shim  layer  between  the  transport  and  network  layers  which  aggregates  congestion 
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control  at  the  end  host,  thereby  enforcing  a  fair  sending  rate  when  an  HTTP  transfer 
employs  multiple  TCP  connections.  “TCP  Session”  [Padmanabhan  1998]  proposes 
integrated  loss  recovery  across  multiple  TCP  connections  to  the  same  web  client  (these 
multiple  TCP  connections  are  together  referred  to  as  a  TCP  session).  All  TCP 
connections  within  a  session  are  assumed  to  share  the  transmission  path  to  the  web 
client.  A  Session  Control  Block  (SCB)  is  maintained  at  the  sender  to  store  information 
about  the  shared  path  such  as  its  cwnd  and  RTT  estimate.  While  CM  and  TCP  Session 
reduce  the  adverse  effects  of  parallel  TCP  connections  on  the  network  and  the 
application,  these  solutions  still  require  a  web  browser  to  open  multiple  TCP 
connections,  thereby  increasing  the  web  server’s  resource  requirements. 

Content  Delivery  Networks  (CDNs)  replicate  web  content  across 
geographically  distributed  servers,  and  reduce  response  times  for  web  users  by 
redirecting  requests  to  a  server  closest  to  the  client.  [Krishnamurthy  2001]  confirms 
that  CDNs  reduce  average  web  response  times  for  web  users  along  USA’s  east  coast 
for  static  content.  Unfortunately,  little  research  exists  on  the  prevalence  of  CDNs  for 
content  providers  and  web  users  outside  of  developed  nations.  Also,  CDNs  cannot 
lessen  web  response  times  when  latency  is  due  to  (i)  propagation  delay  and/or  low 
bandwidth  last  hop,  as  is  the  case  in  developing  regions,  or  (ii)  sub-optimal  traffic 
routing  that  increases  end-to-end  path  RTTs  [Baggaley  2007]. 
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Chapter  3 

NON-RENEGABLE  SACKS  (NR-SACKS)  EOR  SCTP 

This  chapter  discusses  a  fundamentally  new  transport  layer 
acknowledgment  mechanism  called  Non-Renegable  Selective  Acks  (NR-SACKs). 
Sections  3.1  and  3.2  introduce  reneging  in  current  transport  protocol  implementations 
and  the  inefficiencies  with  TCP  and  SCTP  SACK  mechanisms  when  received  data  is 
non-renegable.  Section  3.3  proposes  NR-SACKs  for  SCTP,  and  discusses  the  specifics 
of  SCTP’s  NR-SACK  chunk.  Sections  3.4  and  3.5  discuss  simulation  preliminaries  and 
present  results  comparing  SACKs  vs.  NR-SACKs  in  both  SCTP  and  CMT.  Finally, 
Section  3.6  concludes  and  presents  ongoing  and  future  work. 

3.1  Introduction 

Reliable  transport  protocols  such  as  TCP  and  SCTP  employ  two  kinds  of 
data  acknowledgment  mechanisms:  (i)  cumulative  acks  (cum-acks)  indicate  data  that 
has  been  received  in-sequence,  and  (ii)  selective  acknowledgments  (SACKs)  indicate 
data  that  has  been  received  out-of-order.  In  both  TCP  and  SCTP,  while  cum-acked 
data  is  the  receiver’s  responsibility,  SACKed  data  is  not,  and  SACK  information  is 
advisory  [RFC3517,  RFC4960].  While  SACKs  notify  a  sender  about  the  reception  of 
specific  out-of-order  TPDUs,  the  receiver  is  permitted  to  later  discard  the  TDPUs. 
Discarding  data  that  has  been  previously  SACKed  is  known  as  reneging.  Though 
reneging  is  a  possibility,  the  conditions  under  which  current  transport  layer  and/or 
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operating  system  implementations  renege,  and  the  frequency  of  these  conditions 
occurring  in  practice  (if  any)  are  unknown  and  needs  further  investigation. 

Data  that  has  been  delivered  to  the  application,  by  definition,  is  non- 
renegable  by  the  transport  receiver.  Unlike  TCP  which  never  delivers  out-of-order  data 
to  the  application,  SCTP’s  multistreaming  and  unordered  data  delivery  services 
(Chapter  1)  result  in  out-of-order  data  being  delivered  to  the  application  and  thus 
becoming  non-renegable.  Interestingly,  TCP  and  SCTP  implementations  can  be 
configured  such  that  the  receiver  is  not  allowed  to  and  therefore  never  reneges  on  out- 
of-order  data  (details  in  Section  3.2).  In  these  configurations,  even  non-deUverable  out- 
of-order  data  becomes  non-renegable. 

The  current  TCP  or  SCTP  SACK  mechanism  does  not  differentiate 
between  out-of-order  data  that  “has  been  delivered  to  the  application  and/or  is  non- 
renegable”  vs.  data  that  “has  not  yet  been  delivered  to  the  application  and  is  renegable”. 
In  this  work,  we  introduce  a  fundamentally  new  third  acknowledgment  mechanism 
called  Non-Renegable  Selective  Acknowledgments  (NR-SACKs)  that  enable  a 
transport  receiver  to  explicitly  convey  non-renegable  information  to  the  sender  on  some 
or  all  out-of-order  TPDUs.  While  this  work  introduces  NR-SACKs  for  SCTP,  the  NR- 
SACKs  idea  can  be  applied  to  any  reliable  transport  protocol  that  uses  selective 
acknowledgments  and/or  permits  delivery  of  out-of-order  data,  or  where  a  receiver 
never  reneges  on  previously  acked  data. 

3.2  Problem  Description 

This  section  investigates  the  effect  of  SCTP’s  SACK  mechanism  in 
situations  when  out-of-order  data  is  non-renegable,  and  identifies  conditions  under 
which  SACKs  hurt  performance  in  an  SCTP  or  CMT  association. 
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3.2.1  Background 

The  SCTP  (or  TCP)  send  buffer,  or  the  sender-side  socket  buffer  (Figure 
3.1),  consists  of  two  kinds  of  data:  (i)  new  application  data  waiting  to  be  transmitted 
for  the  first  time,  and  (ii)  copies  of  data  that  have  been  transmitted  at  least  once  and  are 
waiting  to  be  cum-acked,  a.k.a.  the  retransmission  queue  (RtxQ).  Data  in  the  RtxQ  is 
the  transport  sender’s  responsibility  until  the  receiver  has  guaranteed  their  delivery  to 
the  receiving  application,  and/or  the  receiver  guarantees  not  to  renege  on  the  data. 


New  Data  j  Retransmission  Queue 

(RtxQ) 


Send  Buffer 

Figure  3.1:  Transport  Layer  Send  Buffer 


In  traditional  in-order  data  delivery  service,  a  receiver  cum-acks  the  latest 
in-order  data.  Cum-acked  data  has  either  been  delivered  to  the  application  or  is  ready 
for  delivery.  In  either  case,  cum-acks  are  an  explicit  assurance  that  the  receiver  will  not 
renege  on  the  corresponding  data.  Upon  receiving  a  cum-ack,  the  sender  is  no  longer 
responsible,  and  removes  the  corresponding  data  from  the  RtxQ.  In  the  current  SACK 
mechanism,  cum-acks  are  the  only  means  to  convey  non-renegable  information;  all 
selectively  acked  (out-of-order)  data  are  by  default  renegable. 

As  discussed  in  Chapter  1,  SCTP’s  multistreaming  service  divides  an  end- 
to-end  association  into  independent  logical  data  streams.  Data  arriving  in-sequence 
within  a  stream  can  be  delivered  to  the  receiving  application  even  if  the  data  is  out-of- 
order  relative  to  the  association’s  overall  flow  of  data.  Also,  data  marked  for  unordered 
delivery  can  be  delivered  immediately  upon  reception,  regardless  of  the  data’s  position 
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within  the  overall  flow  of  data.  Thus,  SCTP’s  data  delivery  services  result  in  situations 
where  out-of-order  data  is  delivered  to  the  application,  and  is  thus  non-renegable. 

Operating  systems  allow  configuration  of  transport  layer  implementations 
such  that  received  out-of-order  data  is  never  reneged.  For  example,  in  FreeBSD,  the 
net.inet.tcp.do_tcpdrain  or  net.inet.sctp.do_sctp_drain  sysctl  parameters  can  be 
configured  to  never  revoke  kernel  memory  allocated  to  TCP  or  SCTP  out-of-order 
data,  such  that  non-deUverable  out-of-order  data  is  non-renegable.  Thus,  out-of-order 
data  can  also  be  rendered  non-renegable  through  simple  user  configuration. 

In  the  following  discussions,  “non-renegable  out-of-order  data”  refers  to 
data  for  which  the  transport  receiver  takes  full  responsibility,  and  guarantees  not  to 
renege  either  because  (i)  the  data  has  been  delivered  (or  is  deliverable)  to  the 
application,  or  (ii)  the  receiving  system  (OS  and/or  transport  layer  implementation) 
guarantees  not  to  revoke  the  allocated  memory  until  after  the  data  is  delivered  to  the 
application.  With  the  current  SACK  mechanism,  non-renegable  out-of-order  data  is 
selectively  acked,  and  is  (wrongly)  deemed  renegable  by  the  transport  sender. 
Maintaining  copies  of  non-renegable  data  in  the  sender’s  RtxQ  is  unnecessary. 

3.2.2  Unordered  Data  Transfer  using  SACKs 

Using  a  timeline  diagram,  this  section  discusses  the  effects  of  SACKs  in 
transfers  where  all  out-of-order  is  non-renegable.  The  discussion  is  applicable  to  any 
type  of  reliable  data  delivery  service  (in-order,  partial-order,  unordered)  where  all  out- 
of-order  data  is  non-renegable,  but  uses  the  simple  unordered  SCTP  data  transfer 
example  shown  in  Figure  3.2. 

In  this  example,  the  SCTP  send  buffer  denoted  by  the  rectangular  box  can 
hold  a  maximum  of  eight  TPDUs.  Each  SCTP  PDU  is  assigned  a  unique  Transmission 
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Sequence  Number  (TSN).  The  timeline  slice  shown  in  Figure  3.2  picks  up  the  data 
transfer  at  a  point  when  the  sender’s  cwnd  C=8,  allowing  transmission  of  8  TPDUs 
(arbitrarily  numbered  with  TSNs  11-18).  Note  that  when  TSN  18  is  transmitted,  the 
RtxQ  grows  to  fill  the  entire  send  buffer. 
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Figure  3.2:  Unordered  SCTP  Data  Transfer  using  SACKs 


In  this  example,  TSN  1 1  is  presumed  lost  in  the  network.  The  other  TSNs 
are  received  out-of-order  and  immediately  SACKed  by  the  SCTP  receiver.  The  SACKs 
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shown  have  the  following  format:  (S)ACK:  CumAckTSN;  GapAckStart-GapAckEnd. 
Each  gap-ack  start  and  gap-ack  end  value  is  relative  to  the  cum-ack  value,  and  together 
they  specify  a  block  of  received  TSNs. 


At  the  sender,  the  first  SACK  (S:10;2-2)  is  also  a  dupack  and  gap-acks 
TSN  12.  Though  data  corresponding  to  TSN  12  has  been  delivered  to  the  receiving 
application,  the  SACK  does  not  convey  the  non-renegable  nature  of  TSN  12,  requiring 
the  sender  to  continue  being  responsible  for  this  TSN.  Starting  from  the  time  this 
SACK  arrives  at  the  sender,  the  copy  of  TSN  12  in  the  sender’s  RtxQ  is  unnecessary. 
The  gap-ack  for  TSN  12  reduces  the  amount  of  outstanding  data  (O)  to  7  TPDUs. 
Since  0<C,  the  sender  could  in  theory  transmit  new  data,  but  in  practice  cannot  do  so 
since  the  completely  filled  send  buffer  blocks  the  sending  application  from  writing  new 
data  into  the  transport  layer.  We  call  this  situation  send  buffer  blocking.  Note  that  send 
buffer  blocking  prevents  the  sender  from  fully  utilizing  the  cwnd. 

The  second  and  third  dupacks  (S:10;2-3,  S:10;2-4)  increase  the  number  of 
unnecessary  TSNs  in  the  RtxQ,  and  send  buffer  blocking  continues  to  prevent  new  data 
transmission.  On  receipt  of  the  third  dupack,  the  sender  halves  the  cwnd  (C=4),  fast 
retransmits  TSN  11,  and  enters  fast  recovery.  Dupacks  received  during  fast  recovery 
further  increase  the  amount  of  unnecessary  data  in  the  RtxQ,  prolonging  inefficient 
RtxQ  usage.  Note  that  though  these  dupacks  reduce  outstanding  data  (0<C),  send 
buffer  blocking  prevents  new  data  transmission. 

The  sender  eventually  exits  fast  recovery  when  the  SACK  for  TSN  ll’s 
retransmission  (S:18)  arrives.  The  sender  removes  the  unnecessary  copies  of  TSNs  12- 
18  from  the  RtxQ,  and  concludes  the  current  instance  of  send  buffer  blocking.  Since 
send  buffer  blocking  prevented  the  sender  from  fully  utilizing  the  cwnd  before,  the  new 


69 


cum  ack  (S:18)  does  not  increase  the  cwnd  [RFC4960].  The  application  writes  data 
into  the  newly  available  send  buffer  space  and  the  sender  now  transmits  TSNs  19-22. 

Based  on  the  timeline  in  Figure  3.2,  the  following  observations  can  be 
made  regarding  transfers  with  non-renegable  out-of-order  data: 

•  The  unnecessary  copies  of  non-renegable  out-of-order  data  waste  kernel 
memory  (RtxQ).  The  amount  of  wasted  memory  is  a  function  of  flightsize 
(amount  of  data  “in  flight”)  during  a  loss  event;  a  larger  flightsize  wastes  more 
memory. 

•  When  the  RtxQ  grows  to  fill  the  entire  send  buffer,  send  buffer  blocking  ensues, 
which  can  degrade  throughput. 

3.2.3  Implications  to  CMT 

As  discussed  in  Chapter  1,  CMT  is  an  experimental  SCTP  extension  that 
exploits  SCTP  multihoming  for  simultaneous  transfer  of  new  data  over  multiple 
independent  paths  [Iyengar  2006].  Similar  to  an  SCTP  sender,  the  CMT  sender  uses  a 
single  send  buffer  and  RtxQ  for  data  transfer.  However,  the  CMT  sender’s  total 
flightsize  is  the  sum  of  flightsizes  on  each  path.  Since  the  amount  of  kernel  memory  and 
the  probability  of  send  buffer  blocking  increase  as  the  transport  sender’s  flightsize 
increases  (previous  subseetion),  we  hypothesize  that  a  CMT  association  is  even  more 
likely  than  an  SCTP  association  to  suffer  from  the  inefficiencies  of  the  existing  SACK 
mechanism. 

3.3  Solution:  Non-renegable  Selective  Acks 

Non-Renegable  Selective  Acknowledgments  (NR-SACKs)  [Natarajan 
2008a,  Natarajan  2008e]  enable  a  receiver  to  explicitly  convey  non-renegable 


70 


information  on  out-of-order  data.  In  SCTP,  NR-SACKs  provide  the  same  information 
as  SACKs  for  congestion  and  flow  control,  and  the  sender  is  expected  to  process  this 
information  identical  to  SACK  processing.  In  addition,  NR-SACKs  provide  the  added 
option  to  report  some  or  all  of  the  out-of-order  data  as  being  non-renegable. 

3.3.1  NR-SACK  Chunk  Details 

Before  sending/receiving  NR-SACKs,  the  endpoints  first  negotiate  NR- 
SACK  usage  during  association  establishment.  An  endpoint  supporting  the  NR-SACK 
extension  lists  the  NR-SACK  chunk  in  the  Supported  Extensions  Parameter  carried  in 
the  INIT  or  INIT-ACK  chunk  [RFC5061].  During  association  establishment,  if  both 
endpoints  support  the  NR-SACK  extension,  then  each  endpoint  acknowledges  received 
data  with  NR-SACK  chunks  instead  of  SACK  chunks. 

The  proposed  NR-SACK  chunk  for  SCTP  is  shown  in  Figure  3.3.  Since 
NR-SACKs  extend  SACK  functionality,  the  NR-SACK  chunk  has  several  fields 
identical  to  the  SACK  chunk:  the  Cumulative  TSN  Ack,  the  Advertised  Receiver 
Window  Credit,  Gap  Ack  Blocks,  and  Duplicate  TSNs.  These  fields  have  identical 
semantics  to  the  corresponding  fields  in  the  SACK  chunk  [RFC4960].  NR-SACKs  also 
report  non-renegable  out-of-order  data  chunks  in  the  NR  Gap  Ack  Blocks,  a.k.a.  “nr- 
gap-acks”.  Each  NR  Gap  Ack  Block  acknowledges  a  continuous  subsequence  of  non- 
renegable  out-of-order  data  chunks.  All  data  chunks  with  TSNs  >  (Cumulative  TSN 
Ack  -I-  NR  Gap  Ack  Block  Start)  and  <  (Cumulative  TSN  Ack  -i-  NR  Gap  Ack  Block 
End)  of  each  NR  Gap  Ack  Block  are  reported  as  non-renegable.  The  Number  of  NR 
Gap  Ack  Blocks  (M)  field  indicates  the  number  of  NR-Gap  Ack  Blocks  included  in  the 
NR-SACK  chunk. 
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Note  that  each  sequence  of  TSNs  in  an  NR  Gap  Ack  Block  will  be  a 
subsequence  of  one  of  the  Gap  Ack  Blocks,  and  there  can  be  more  than  one  NR  Gap 
Ack  Block  per  Gap  Ack  Block.  Also,  non-renegable  information  cannot  be  revoked.  If 
a  TSN  is  nr-gap-acked  in  any  NR-SACK  chunk,  then  all  subsequent  NR-SACKs  gap- 
acking  that  TSN  should  also  nr-gap-ack  that  TSN.  Complete  details  of  NR-SACK 
chunk  can  be  found  in  [Natarajan  2008a]. 
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Figure  3.3:  NR-SACK  Chunk  for  SCTP 


The  second  least  significant  bit  in  the  Chunk  Flags  field  is  the  (A)ll  bit.  If 
the  ‘A’  bit  is  set  to  T,  all  out-of-order  data  blocks  acknowledged  in  the  NR-SACK 
chunk  are  non-renegable.  The  ‘A’  bit  enables  optimized  sender/receiver  processing  and 
reduces  the  size  of  NR-SACK  chunks  when  all  out-of-order  TPDUs  at  the  receiver  are 
non-renegable. 
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3.3.2  Unordered  Data  Transfer  using  NR-SACKs 

NR-SACKs  provide  an  SCTP  receiver  with  the  option  to  convey  non- 
renegable  information  on  out-of-order  data.  When  a  receiver  guarantees  not  to  renege 
an  out-of-order  data  chunk  and  nr-gap-acks  the  chunk,  the  sender  no  longer  needs  to 
keep  that  particular  data  chunk  in  its  RtxQ,  thus  allowing  the  sender  to  free  up  kernel 
memory  sooner  than  if  the  data  chunk  were  only  gap-acked. 
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Figure  3.4:  Unordered  SCTP  Data  Transfer  using  NR-SACKs 


Figure  3.4  is  analogous  to  Figure  3.2’s  example,  this  time  using  NR- 
SACKs.  The  sender  and  receiver  are  assumed  to  have  negotiated  the  use  of  NR- 
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SACKS  during  association  establishment.  As  in  the  example  of  Figure  3.2,  TSNs  11-18 
are  initially  transmitted,  and  TSN  1 1  is  presumed  lost.  For  each  TSN  arriving  out-of- 
order,  the  SCTP  receiver  transmits  an  NR-SACK  chunk  instead  of  SACK  chunk.  Since 
all  out-of-order  data  are  non-renegable  in  this  example,  every  NR-SACK  ehunk  has  the 
‘A’  bit  set,  and  the  nr-gap-acks  report  the  list  of  TSNs  that  are  both  out-of-order  and 
non-renegable. 

All  NR-SACKs  in  Figure  3.4  have  the  following  format:  (N)R-SACK: 
CumAckTSN;  NRGapAckStart-NRGapAckEnd.  The  first  NR-SACK  (N:10;2-2)  is 
also  a  dupack.  This  NR-SACK  cum-acks  TSN  10,  and  (nr-)gap-acks  TSN  12.  Once  the 
data  sender  is  informed  that  TSN  12  is  non-renegable,  the  sender  frees  up  the  kernel 
memory  allocated  to  TSN  12,  allowing  the  application  to  write  more  data  into  the 
newly  available  send  buffer  space.  Since  TSN  12  is  also  gap-acked,  the  amount  of 
outstanding  data  (O)  is  reduced  to  7,  allowing  the  sender  to  transmit  new  data  -  TSN 
19. 

On  receipt  of  the  second  and  third  dupacks  that  newly  (nr-)gap-ack  TSNs 
13  and  14,  the  sender  removes  these  TSNs  from  the  RtxQ.  On  receiving  the  second 
dupack,  the  sender  transmits  new  data  -  TSN  20.  On  receipt  of  the  third  dupack,  the 
sender  halves  the  cwnd  (C=4),  fast  retransmits  TSN  11,  and  enters  fast  recovery. 
Dupacks  received  during  fast  recovery  (nr-)gap-ack  TSNs  15-20.  The  sender  frees 
RtxQ  accordingly,  and  transmits  new  TSNs  21,  22  and  23.  The  sender  exits  fast 
recovery  when  the  NR-SACK  with  new  cum-ack  (N:20)  arrives.  This  new  cum-ack 
increments  C=5,  and  decrements  0=3.  The  sender  now  transmits  new  TSNs  24  and  25. 

The  explicit  non-renegable  information  in  NR-SACKs  ensures  that  the 
RtxQ  contains  only  necessary  data  -  TPDUs  that  are  actually  in  flight  or  “received  and 
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renegable”.  Comparing  Figures  3.2  and  3.4,  we  observe  that  NR-SACKs  use  the  RtxQ 
more  efficiently. 

3.4  Evaluation  Preliminaries 

The  ns-2  SCTP  and  CMT  modules  [NS-2,  Ekiz  2007]  were  extended  to 
support  and  process  NR-SACK  chunks.  The  simulation-based  evaluations  compare 
long-lived  SCTP  or  CMT  flows  using  SACKs  vs.  NR-SACKs  under  varying  cross- 
traffic  loads.  This  section  discusses  the  experiment  setup  and  other  evaluation 
preliminaries  in  detail. 

3.4.1  Simulation  Setup 

[Andrew  2008]  recommends  specific  simulation  setups  and  parameters  for 
realistic  evaluations  of  TCP  extensions  and  congestion  control  algorithms.  These 
recommendations  include  network  topologies,  details  of  cross-traffic  generation,  and 
delay  distributions  mimicking  patterns  observed  in  the  Internet.  We  adhere  to  these 
recommendations  for  a  realistic  evaluation  of  SACKs  vs.  NR-SACKs. 

The  SCTP  evaluations  use  the  dumb-bell  topology  shown  in  Figure  3.5, 
which  models  the  access  link  scenario  specified  in  [Andrew  2008].  The  central 
bottleneck  link  connects  routers  Ri  (left)  and  R2  (right),  has  a  100Mbps  capacity,  and 
2ms  one-way  propagation  delay.  Both  routers  employ  drop  tail  queuing  and  the  queue 
size  is  set  to  the  bandwidth-delay  product  of  a  100ms  flow.  Each  router  is  connected  to 
three  cross-traffic  generating  edge  nodes  via  100Mbps  edge  links  with  the  following 
propagation  delays:  0ms,  12ms,  25ms  (left)  and  2ms,  37ms,  75ms  (right).  Each  left 
edge  node  generates  cross-traffic  destined  to  every  right  edge  node  and  vice-versa. 
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Thus,  without  considering  queuing  delays,  the  RTTs  for  cross-traffic  flows  sharing  the 
bottleneck  link  range  from  Sms — 204ms. 

[Andrew  2008]  recommends  application  level  cross-traffic  generation  over 
packet  level  generation,  since,  in  the  latter  scenario,  cross-traffic  flows  do  not  respond 
to  the  user/appUcation/transport  behavior  of  competing  flows.  Also,  [Andrew  2008] 
proposes  the  use  of  Tmix  [Weigle  2006]  traffic  generator.  However,  the  recommended 
Tmix  connection  vectors  were  unavailable  at  the  time  of  performing  our  evaluations. 
Therefore,  we  decided  to  employ  existing  ns-2  application  level  traffic  generation  tools, 
recommended  by  [Wang  2007a,  Wang  2007b].  Since  our  simulation  setup  uses 
application  level  cross-traffic,  we  believe  that  the  general  conclusions  from  our 
evaluations  will  hold  for  evaluations  using  the  Tmix  traffic  generator. 


Figure  3.5:  Topology  for  SCTP  Experiments  (Topology  1) 

Cross-traffic  generated  by  three  kinds  of  applications  are  considered:  (i) 
non-greedy,  responsive  HTTP  sessions  generated  by  PackMime  implementation  [Cao 
2004],  (ii)  rate  controlled,  unresponsive  video  sessions  over  UDP,  and  (iii)  greedy. 
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responsive  bulk  file  transfer  sessions  over  TCP.  We  are  unaware  of  existing 
measurement  studies  on  the  proportion  of  eaeh  kind  of  traffic  observed  in  the  Internet. 
Therefore,  the  simulations  assume  a  simple,  yet  reasonable  rule  for  the  traffic  mix 
proportion  -  more  HTTP  traffic  than  video  or  FTP  traffic. 

Each  edge  node  runs  a  PackMime  session  to  every  edge  node  on  the  other 
side,  and  the  amount  of  generated  HTTP  traffic  is  controlled  via  the  PackMime  rate 
parameter.  Similarly,  each  edge  node  establishes  video  and  FTP  sessions  to  every  edge 
node  on  the  other  side,  and  the  number  of  video/FTP  sources  on  each  node  impacts  the 
amount  of  video/FTP  traffic.  To  avoid  synchronization  issues,  the  PackMime,  video, 
and  FTP  sessions  start  at  randomly  chosen  times  during  the  initial  5  seconds  of  the 
simulation.  The  default  segment  size  for  all  TCP  traffic  results  in  1500  byte  IP  PDUs; 
the  segment  size  for  10%  of  the  FTP  flows  is  modified  to  result  in  576  byte  IP  PDUs. 
Also,  the  PackMime  request  and  response  size  distributions  are  seeded  in  every 
simulation  run,  resulting  in  a  range  of  packet  sizes  at  the  bottleneck  [Andrew  2008]. 

The  bottleneck  router  load  is  measured  as  (F)  =  (mean  queue  length  ^  total 
queue  size).  Four  packet-level  load/congestion  variations  are  considered:  (i)  Fow 
(-15%  load,  <0.1%  loss),  (ii)  Mild  (-45%  load,  1-2%  loss),  (iii)  Medium  (-60%  load, 
3-4%  loss),  (iv)  Heavy  (-85%  load,  8-9%  loss). 

Topology  1  (Figure  3.5)  is  used  to  evaluate  SCTP  flows.  CMT  evaluations 
are  over  the  dual-dumbbell  topology  shown  in  Figure  3.6  (topology  2).  Topology  2 
consists  of  two  independent  bottleneck  links  between  routers  R1-R2  and  R3-R4.  Similar 
to  topology  1,  each  router  in  topology  2  is  attached  to  3  cross-traffic  generating  edge 
nodes,  with  similar  bottleneck  and  edge  link  bandwidth/delay  characteristics.  In  both 
topologies,  nodes  S  and  R  are  the  SCTP  or  CMT  sender  and  receiver,  respectively.  In 
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topology  2,  both  S  and  R  are  multihomed,  and  the  CMT  sender  uses  the  two 
independent  paths  (paths  1  and  2)  for  simultaneous  data  transfer.  In  both  topologies,  S 
and  R  are  eonneeted  to  the  bottleneek  routers  via  100Mbps  duplex  edge  links,  with 
14ms  one-way  delay.  Thus,  the  one-way  propagation  delay  experieneed  by  the  SCTP 
or  the  CMT  flow  eorresponds  to  30ms,  approximating  the  US  eoast-to-eoast 
propagation  delay  [Shakkottai  2004]. 


Figure  3.6:  Topology  for  CMT  Experiments  (Topology  2) 

In  both  topologies,  the  bottleneek  links  experienee  bi-direetional  eross- 
traffie;  the  eross-traffie  load  is  similar  on  both  forward  and  reverse  direetions.  In 
topology  1,  the  eross-traffie  load  varies  from  low  to  heavy.  For  CMT  evaluations  using 
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topology  2,  the  bottlenecks  experience  asymmetric  path  loads;  path  1  cross-traffic  load 
varies  from  low  to  heavy,  while  path  2  experiences  low  load. 

The  SCTP  or  CMT  flow  initiates  an  unordered  data  transfer  -18-20 
seconds  after  the  simulation  begins  such  that,  all  data  received  out-of-order  at  R  is 
deliverable,  and  thus,  non-renegable.  Trace  collection  begins  after  a  20  second  warm¬ 
up  period  from  the  start  of  SCTP  or  CMT  traffic,  and  ends  when  the  simulation 
completes  after  70  seconds.  The  CMT  sender  uses  the  recommended  RTX- 
SSTHRESH  retransmission  policy,  i.e.,  retransmissions  are  sent  on  the  path  with 
highest  ssthresh  [Iyengar  2006]. 

3.4.2  Metric:  Efficient  Retransmission  Queue  Utilization 

In  transfers  using  SACKs,  the  RtxQ  consists  of  two  kinds  of  data  (Figure 
3.2):  (i)  necessary  data  -  data  that  is  either  “in  flight”  and  has  not  yet  reached 
receiver’s  transport  layer,  or  data  that  has  been  received  but  is  renegable  by  the 
transport  receiver,  and  (ii)  unnecessary  data  -  data  that  is  received  out  of  order  and  is 
non-renegable.  The  RtxQ  is  most  efficiently  utilized  when  all  data  in  the  RtxQ  are 
necessary.  As  the  fraction  of  unnecessary  data  increases,  the  RtxQ  is  less  efficiently 
utilized. 

The  transport  sender  modifies  the  RtxQ  as  and  when  SACKs  or  NR- 
SACKs  arrive.  The  RtxQ  size  varies  during  the  course  of  a  file  transfer,  but  can  never 
exceed  the  send  buffer  size.  For  time  duration  ti  in  the  transfer,  let, 
r,  =  size  of  retransmission  queue,  and 
ki  =  amount  of  necessary  data  in  the  RtxQ. 

During  ti,  only  ki  ^  ri  of  the  RtxQ  is  efficiently  utilized,  and  the  efficiency 
changes  whenever  ki  or  r,  changes. 
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Let 


be  the  efficient  RtxQ  utilization  values  during  time 


RtxQ  utilization,  the  ns-2  SCTP  (or  CMT)  sender  tracks  ki,  r,,  and  t,  until  association 
shutdown.  Let, 

W  =  time  when  trace  collection  begins  after  the  initial  warm-up  time,  and 

E  =  simulation  end  time. 

In  the  following  discussions,  the  time  weighted  efficient  RtxQ  utilization 
averaged  over  the  entire  trace  collection  time,  i.e.,  T  =  (E  -  W),  is  referred  to  as 
RtxQ_Util. 

In  an  unordered  transfer  using  NR-SACKs,  aU  out-of-order  data  will  be  nr- 
gap-acked  and  the  RtxQ  should  contain  only  necessary  data.  Therefore,  we  expect  an 
SCTP  or  CMT  flow  using  NR-SACKs  to  most  efficiently  utilize  the  RtxQ  {RtxQ_Util 
=  1)  under  all  circumstances. 


3.4.3  Retransmission  Queue  Utilization  during  Loss  Recovery 

Typically,  in  SCTP  transfers,  data  is  always  received  in-order  during  no 
losses,  unless  the  intermediate  routers  reorder  packets.  Consequently,  during  no  losses, 
SCTP  flows  employing  either  SACKs  or  NR-SACKs  utilize  the  RtxQ  most  efficiently, 
and  the  corresponding  RtxQ_Util  values  equal  unity.  The  two  acknowledgment 
mechanisms  differ  in  RtxQ  usage  only  when  data  is  received  out-of-order,  which  ensues 
when  an  SCTP  flow  suffers  packet  losses.  Specifically,  in  SCTP,  the  duration  of  NR- 
SACKs’  impact  on  the  RtxQ  is  limited  to  loss  recovery  periods.  To  evaluate  the  impact 
of  the  two  ack  schemes  during  loss  recovery  periods,  the  ns-2  SCTP  sender  timestamps 
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every  entry/exit  to/from  loss  recovery.  Since  none  of  the  routers  reorder  packets  in  our 
simulations,  the  SCTP  sender  uses  the  following  naive  rule  -  the  sender  enters  loss 
recovery  on  the  receipt  of  SACKs  (or  NR-SACKs)  with  at  least  one  gap-ack  block, 
and  exits  loss  recovery  on  the  receipt  of  SACKs  (or  NR-SACKs)  with  a  new  cum-ack 
and  zero  gap-acks.  We  found  that  this  simple  rule  resulted  in  a  good  approximation  of 


the  actual  loss  recovery  periods. 

k  k  k 

Let  ^  be  the  efficient  RtxQ  utilization  values  during  the  loss 

ro  h 

recovery  periods  /q, =l),  respectively.  The  time  weighted  efficient  RtxQ 


utilization  averaged  over  only  the  loss  recovery  durations  of  trace  collection  (L)  is 


refereed  to  as  RtxQ_Util_L,  and  is  calculated  as  RtxQ _Util _L  = 


h-L. 


An  SCTP  sender  tracked  both  RtxQ_Util  and  RtxQ_Util_L.  Depending  on 
the  paths’  bandwidth/delay  characteristics,  a  CMT  association  experiences  data 
reordering  even  under  no  loss  conditions.  Data  transmitted  on  the  shorter  delay  path 
will  be  received  out-of-order  w.r.t.  data  transmitted  on  other  path(s).  Therefore,  the 
naive  rule  mentioned  above  cannot  be  employed  to  estimate  entry/exit  of  CMT  sender’s 
loss  recovery,  and  the  CMT  sender  tracked  only  RtxQJJtil. 


3.5  Results 

For  each  type  of  sender  (SCTP  or  CMT),  different  send  buffer  sizes 
imposing  varying  levels  of  memory  constraints  are  considered:  32K,  64K  and  INF 
(unconstrained  space)  for  SCTP,  and  128K,  256K  and  INF  for  CMT.  The  results 
presented  here  are  averaged  over  30  runs,  and  plotted  with  95%  confidence  intervals. 
In  the  following  discussions,  an  SCTP  flow  using  SACKs  or  NR-SACKs  is  referred  to 
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as  SCTP-SACKs  and  SCTP-NR-SACKs,  respectively.  Similarly,  a  CMT  flow  using 
SACKS  or  NR-SACKs  is  referred  to  as  CMT-SACKs  and  CMT-NR-SACKs. 


3.5.1  Retransmission  Queue  Utilization 

As  the  end-to-end  path  gets  more  congested,  SCTP-SACKs’  RtxQ_Util_L 
remains  fairly  consistent  -0.5  (Figure  3.7),  while  the  RtxQ_Util  decreases  (Figure  3.8). 
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Figure  3.7:  RtxQ  Utilization  during  Loss  Recovery  in  SCTP 
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The  RtxQ_Util_L  values  indicate  that  irrespective  of  path  loss  rate,  SCTP- 
SACKs  efficiently  utilize  only  -50%  of  RtxQ  during  loss  recovery;  -50%  of  RtxQ  is 
wasted  buffering  unnecessary  data.  At  lower  congestion  levels  (lower  cross-traffic),  the 
frequency  of  loss  events  and  the  fraction  of  transfer  time  spent  in  loss  recovery  are 
smaller,  resulting  in  negligible  RtxQ  wastage  during  the  entire  trace  collection  period 
(RtxQ_Util).  As  loss  recoveries  become  more  frequent,  SCTP-SACKs’  inefficient 
RtxQ  utilization  during  loss  recovery  lowers  the  corresponding  RtxQ_Util  values.  The 
simulation  results  show  that,  on  average,  SCTP-SACKs  waste  -20%  of  the  RtxQ 
during  moderate  congestion  and  -30%  during  heavy  congestion  conditions.  The 
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amount  of  wasted  kernel  memory  increases  as  the  number  of  transport  connections 
increase,  and  can  be  significant  at  a  server  handling  large  numbers  of  concurrent 
connections,  such  as  a  web  server. 
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Figure  3.8:  RtxQ  UtiUzation  in  SCTP 
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Figure  3.9:  RtxQ  UtiUzation  in  CMT 

By  definition  of  the  RtxQ_Util  metric,  NR-SACKs  are  expected  to  utiUze 
the  RtxQ  most  efficiently,  even  during  loss  recovery  periods  (Section  3.4.2).  The 
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simulation  results  confirm  this  hypothesis.  Under  all  traffic  loads,  RtxQ_Util  values  for 
both  SCTP-NR-SACKs  and  CMT-NR-SACKs  (Figure  3.9)  are  unity. 

In  CMT  evaluations,  path  2  experiences  low  traffic  load,  while  path  I’s 
traffic  load  varies  from  low  to  heavy  (Figure  3.6).  Recall  that  a  CMT  sender  transmits 
data  concurrently  on  both  paths.  Asymmetric  path  congestion  levels  aggravate  data 
reordering  in  CMT.  As  path  1  congestion  level  increases,  TPDU  losses  on  the  higher 
congested  path  1  cause  data  transmitted  on  the  lower  congested  path  2  to  arrive  out- 
of-order  at  the  receiver.  CMT  congestion  control  is  designed  such  that  losses  on  path  1 
do  not  affect  the  cwnd/flightsize  on  path  2  [Iyengar  2006].  While  losses  on  path  1  are 
being  recovered,  sender  continues  data  transmission  on  path  2,  increasing  the  amount 
of  non-renegable  out-of-order  data  in  the  RtxQ.  As  the  paths  become  increasingly 
asymmetric  in  their  congestion  levels,  the  amount  of  non-renegable  out-of-order  data  in 
the  RtxQ  increases,  and  brings  down  CMT-SACKs’  RtxQ_Util  (Figure  3.9). 

Increasing  the  send  buffer/RtxQ  space  improves  SCTP-SACKs’  or  CMT- 
SACKs’  kernel  memory  (RtxQ)  utilization  only  to  a  certain  degree.  In  Figures  3.8  and 
3.9,  RtxQ_Util  for  the  INF  send  buffer  is  essentially  the  upper  bound  on  how  efficient 
SCTP  or  CMT  employing  SACKs  utilizes  the  RtxQ.  Therefore,  we  conclude  that 
TPDU  reordering  results  in  inevitable  RtxQ  wastage  in  transfers  using  SACKs.  The 
amount  of  wasted  memory  increases  as  TPDU  reordering  and  loss  recovery  durations 
increase.  Also,  smaller  send  buffer  sizes  further  degrade  RtxQ_Util_L  and  RtxQ_Util 
values  in  transfers  using  SACKs.  This  degradation  is  more  pronounced  in  CMT  (Figure 
3.9).  Further  investigations  reveal  this  effect  to  be  due  to  send  buffer  blocking, 
discussed  next. 
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3.5.2  Send  Buffer  Blocking  in  CMT 

When  the  RtxQ  grows  to  fill  the  entire  send  buffer,  send  buffer  blocking 
ensues,  preventing  the  application  from  writing  new  data  into  the  transport  layer 
(Section  3.2.2).  In  both  SCTP  and  CMT,  send  buffer  blocking  increases  as  the  send 
buffer  is  more  constrained  (decreases).  In  addition,  CMT  employs  multiple  paths  for 
data  transfer,  increasing  a  sender’s  total  fiightsize  in  comparison  to  SCTP.  Therefore, 
we  hypothesized  that  CMT  would  suffer  more  send  buffer  blocking  than  SCTP 
(Section  3.2.3).  Indeed,  in  the  simulations,  CMT  suffered  significant  send  buffer 
blocking  even  for  128K  and  256K  send  buffer  sizes.  In  this  section,  we  focus  on  the 
effects  of  send  buffer  blocking  in  CMT. 

CMT  using  either  acknowledgment  scheme  suffers  from  send  buffer 
blocking  for  128K  and  256K  buffer  sizes.  In  CMT-SACKs,  send  buffer  blocking 
continues  until  cum-ack  point  moves  forward,  i.e.,  until  loss  recovery  ends.  As  path  1 
congestion  level  increases,  timeout  recoveries  become  more  frequent,  causing  longer 
loss  recovery  durations.  Therefore,  as  congestion  increases,  the  CMT-SACKs  sender  is 
blocked  for  longer  periods  of  transfer  time.  On  the  other  hand,  send  buffer  blocking  in 
CMT-NR-SACKs  is  unaffected  by  the  congestion  level  on  path  1.  As  and  when  NR- 
SACKs  arrive  (on  path  2),  the  CMT-NR-SACK  sender  removes  nr-gap-acked  data 
from  the  RtxQ,  allowing  more  data  transmission.  CMT-SACKs’  longer  send  buffer 
blocking  durations  adversely  impact  performance  as  discussed  below. 

3.5.2. 1  Ineffective  Use  of  Send  Buffer  Space 

Send  buffer  blocking  limits  RtxQ  growth  and  reduces  throughput.  The 
impact  on  throughput  is  minimized  when  the  available  send  buffer  space  is  utilized  as 
much  as  possible. 
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Figures  3.10  and  3.11  illustrate  CMT  sender’s  RtxQ  evolution  over  40 
seconds  of  a  transfer  using  SACKs  and  NR-SACKs,  respectively.  The  figures  show 
that  both  CMT-SACKs  and  CMT-NR-SACKs  suffer  from  send  buffer  blocking  -  the 
maximum  RtxQ  size  in  the  figures  corresponds  to  100%  of  send  buffer  (128K). 
However,  the  RtxQ  evolution  in  CMT-SACKs  (Figure  3.10)  exhibits  more  variance  - 
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reaches  the  maximum  and  drops  to  0  multiple  times,  while  CMT-NR-SACKs’  RtxQ 
size  is  closer  to  128K  most  of  the  time  (Figure  3.1 1). 


Figure  3.12:  RtxQ  Evolution  in  CMT-SACKs  (-1.5  sec) 

Figure  3.12  is  a  zoom  of  CMT-SACKs’  RtxQ  evolution  over  an  arbitrary 
1.5  second  period.  At  point  A  (time  66.36sec),  RtxQ  size  hits  the  maximum,  and  the 
sender  is  blocked  from  transmitting  any  more  data.  Subsequent  SACKs  reduce  the 
amount  of  outstanding  data,  but  send  buffer  blocking  prevents  the  sender  from  clocking 
out  new  data.  At  time  66.42sec,  path  I’s  retransmission  timer  expires;  the  sender 
detects  loss,  and  retransmits  TSN  20369  on  path  2.  At  time  66.48sec  (point  B),  sender 
receives  a  SACK  with  a  new  cum-ack  (TSN=20457)  and  completely  clears  RtxQ 
contents,  ending  the  current  instance  of  send  buffer  blocking.  The  sender  immediately 
transmits  new  data  on  both  paths,  and  the  RtxQ  evolution  after  the  new  cum-ack 
(TSN=20457)  is  shown  by  the  (green)  dashed  line.  The  cwnd  on  path  1  allows 
transmission  of  2  MTU  sized  TPDUs  (TSNs  20458  and  20459).  The  cwnd  on  path  2  is 
127162  bytes,  but  the  Maxburst  parameter  [RFC4960]  limits  the  sender  to  transmit 
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only  4  MTU  sized  TPDUs  -  TSNs  20460-20463.  Once  the  sender  transmits  data  on 
both  paths,  RtxQ  size  increases  to  ~8.6K,  shown  by  point  C.  Subsequent  SACKs  allow 
more  data  transmission  and  at  point  D  the  sender’s  RtxQ  reaches  the  maximum  causing 
the  next  instance  of  send  buffer  blocking. 

Though  CMT-NR-SACKs  also  incurs  send  buffer  blocking  (Figure  3.11), 
nr-gap-acks  free  up  RtxQ  space  allowing  the  sender  to  steadily  clock  out  more  data.  A 
constrained  send  buffer  is  better  utilized,  and  the  transmission  is  less  bursty  with  NR- 
SACKs  than  SACKs.  The  improved  send  buffer  use  contributes  to  throughput 
improvements  (discussed  later). 

3.5.2.2  Efficient  Retransmission  Queue  Utilization 

In  Figure  3.9,  CMT-SACKs’  RtxQ_Util  worsens  as  send  buffer  blocking 
increases  (send  buffer  size  decreases).  As  discussed  earlier,  in  CMT-SACKs,  send 
buffer  blocking  prevents  new  data  transmission  until  loss  recovery  ends.  Lack  of  new 
data  transmission  resulted  in  fewer  and  sometimes  insufficient  acks  to  trigger  fast 
retransmits.  Consequently,  blocked  CMT-SACKs  experienced  more  timeout  recoveries 
(RTOs)  at  heavy  traffic  loads  than  non-blocked  CMT-SACKs  (Figure  3.13).  As  the 
send  buffer  is  more  constrained,  the  average  number  of  RTOs  increase,  and  the  fraction 
of  transfer  time  spent  in  loss  recovery  increases.  Longer  loss  recovery  durations 
increase  the  duration  of  inefficient  RtxQ  utilization,  and  bring  down  blocked  CMT- 
SACKs’  RtxQ_Util  values  compared  to  non-blocked  (INF)  CMT-SACKs’  RtxQ_Util. 

On  the  other  hand,  CMT-NR-SACKs  steadily  clock  out  data,  and  do  not 
incur  excessive  RTOs  during  send  buffer  blocking.  CMT-NR-SACKs’  mean  number  of 
RTOs  for  128K  and  256K  buffer  sizes  are  similar  to  the  INF  case  (Figure  3.13).  To 
summarize,  send  buffer  blocking  worsens  CMT-SACKs’  RtxQ  utilization.  Blocked 
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CMT-SACKs’  inefficient  send  buffer  usage  increases  the  number  of  timeout 
recoveries,  and  degrades  throughput  when  compared  to  CMT-NR-SACKs. 


Send  Buffer  Size 

Figure  3.13:  Mean  Number  of  RTOs  during  Heavy  Cross-traffic  in  CMT 


3.5.2.3  Throughput 

When  the  send  buffer  never  limits  RtxQ  growth  (INF  send  buffer  size), 
both  CMT-SACKs  and  CMT-NR-SACKs  do  not  experience  send  buffer  blocking,  and 
perform  similarly  (Figure  3.14).  However,  CMT-SACKs  achieve  the  same  throughput 
as  CMT-NR-SACKs  at  the  cost  of  larger  RtxQ  sizes. 

Using  terminology  defined  in  Section  3.4.2,  the  average  RtxQ  size,  RtxQ 
over  the  entire  trace  collection  period  (7)  is  calculated  as,  RtxQ  =  (Z  t,xr,)-r. 

Figure  3.15  plots  CMT-SACKs  vs.  CMT-NR-SACKs  RtxQ  for  the  INF  case.  As  path  1 
cross-traffic  load  increases,  the  bandwidth  available  for  the  CMT  flow  decreases,  and 
CMT-NR-SACKs’  RtxQ  decreases  (Figure  3.15). 
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Path1  Cross-traffic  Load  (%) 

Figure  3.14:  CMT-SACKs  vs.  CMT-NR-SACKs  Throughput 
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Figure  3.15:  CMT-SACKs  vs.  CMT-NR-SACKs  Average  RtxQ  Size 


Similarly,  CMT-SACKs’  RtxQ  decreases  as  traffic  load  increases  from  low  to  mild. 
However,  a  different  factor  dominates  and  increases  CMT-SACKs’  RtxQ  during 
medium  and  heavy  traffic  conditions.  Note  that  RtxQ  growth  is  never  constrained  in  the 
INF  case,  enabling  the  CMT  sender  to  transmit  as  much  data  as  possible  on  path  2 
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while  recovering  from  losses  on  path  1.  At  medium  and  heavy  cross-traffic  loads,  loss 
recovery  durations  increase  due  to  increased  timeout  recoveries,  and  the  CMT-SACKs 
sender  transmits  more  data  on  path  2  compared  to  mild  traffic  conditions.  This  factor 
increases  CMT-SACKs’  RtxQ  during  medium  and  heavy  traffic  conditions. 

Going  back  to  Figure  3.14,  when  the  send  buffer  size  limits  RtxQ  growth, 
CMT-NR-SACKs’  efficient  RtxQ  utilization  enables  CMT-NR-SACKs  to  perform 
better  than  CMT-SACKs.  The  throughput  improvements  in  CMT-NR-SACKs  increase 
as  conditions  that  aggravate  send  buffer  blocking  increases.  I.e.,  NR-SACKs  improve 
throughput  more  as  send  buffer  becomes  more  constrained  and/or  when  the  paths 
become  more  asymmetric  in  the  congestion  levels.  Alternately,  CMT-NR-SACKs 
achieve  similar  throughput  as  CMT-SACKs  using  smaller  send  buffer  sizes.  For 
example,  during  mild,  medium  and  heavy  path  1  cross-traffic  load,  CMT-NR-SACKs 
with  128K  send  buffer  performs  similar  or  better  than  CMT-SACKs  with  256K  send 
buffer.  Also,  CMT-NR-SACKs  with  256K  send  buffer  performs  similar  to  CMT- 
SACKs  with  larger  (unconstrained)  send  buffer. 

3.6  Conclusion,  Ongoing  and  Future  Work 

This  work  investigated  the  effects  of  existing  transport  layer  SACK 
mechanism  when  data  received  out-of-order  is  non-renegable.  We  conclude  that 
SACKs  cause  inevitable  sender  memory  wastage,  which  worsens  as  data  reordering 
and  loss  recovery  durations  increase.  We  proposed  a  new  ack  mechanism,  Non- 
Renegable  Selective  Acknowledgments  (NR-SACKs)  for  SCTP,  which  provides  the 
transport  receiver  with  the  option  to  convey  non-renegable  information  on  some  or  all 
out-of-order  data.  The  concept  of  NR-SACKs  is  applicable  to  any  reliable  transport 
employing  SACKs  and/or  provides  out-of-order  data  delivery. 
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Note  that  a  transfer  employing  NR-SACKs  never  performs  worse  than  a 
transfer  using  SACKs.  When  out-of-order  data  is  non-renegable,  NR-SACKs  perform 
better  than  SACKs.  Simulations  confirmed  that  in  both  SCTP  and  CMT,  NR-SACKs 
utilize  send  buffer  and  RtxQ  space  most  efficiently.  Send  buffer  blocking  in  CMT  with 
SACKs  adversely  impacts  end-to-end  performance,  while  efficient  send  buffer  use  in 
CMT  with  NR-SACKs  alleviates  send  buffer  blocking.  Therefore,  NR-SACKs  not  only 
reduce  sender’s  memory  requirements,  but  also  improve  throughput  in  CMT.  The  only 
negative  with  NR-SACKs  is  the  added  complexity  of  implementation,  and  the  extra 
overhead  to  generate  and  process  NR-SACKs.  We  argue  these  negatives  are  negligible. 

3.6.1  IETF  Internet  Draft 

We  plan  to  standardize  the  design  and  processing  specifics  of  the  SCTP 
NR-SACK  chunk,  and  have  proposed  the  same  as  an  IETF  Internet  Draft  in  the 
transport  area  working  group  (TSVWG)  [Natarajan  2008a].  The  details  of  the  NR- 
SACK  chunk  and  the  simulation  results  were  presented  at  the  7E*  and  72"‘*  IETF 
meetings.  Based  on  the  positive  feedback  from  the  TSVWG  members,  the  proposal  has 
been  modified  to  be  an  experimental  item,  and  is  currently  being  implemented  in  the 
reference  SCTP  implementation  on  FreeBSD.  As  future  work,  we  also  plan  on 
conducting  empirical  studies  to  gather  information  on  how  often  reneging  occurs,  if 
any,  in  practice. 

3.6.2  NR-SACKs  Implementation  in  FreeBSD 

Ertugrul  Yiknaz  is  heading  the  on-going  effort  to  implement  NR-SACKs  in 
the  FreeBSD  SCTP  stack.  This  effort  involves  defining  the  NR-SACK  chunk  structure, 
modifying  the  sender  and  receiver  code  to  generate  and  process  NR-SACKs, 
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respectively,  and  defining  a  test  suite  to  debug  the  NR-SACKs  implementation.  In  the 
future,  we  plan  to  draw  on  the  FreeBSD  implementation  to  compare  SACKs  vs.  NR- 
SACKs  performance  for  both  SCTP  and  CMT. 
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Chapter  4 

CMT  PERFORMANCE  DURING  FAILURE 

This  chapter  discusses  the  third  problem  -  Concurrent  Multipath  Transfer 
(CMT)  performance  during  path  failures.  Seetion  4.1  motivates  this  research  by 
discussing  the  commonness  of  link  failures  in  the  Internet.  Section  4.2  overviews 
CMT’s  failure  detection  process,  and  diseusses  how  CMT’s  throughput  degrades 
during  path  failures.  Section  4.3  details  a  proposed  solution  to  the  problem  -  CMT 
with  the  “potentially-failed”  destination  state  (CMT-PF).  Seetions  4.4  and  4.5  present 
simulation  based  evaluations  of  CMT  vs.  CMT-PF  during  failure  and  congestion, 
respeetively.  Finally,  Seetion  4.6  eoneludes  and  presents  ongoing,  future  and  related 
work. 

4.1  Motivation 

As  discussed  in  Chapter  1,  SCTP  natively  supports  transport  layer 
multihoming  for  fault-tolerance  purposes.  Concurrent  Multipath  Transfer  (CMT) 
[Iyengar  2006]  is  an  experimental  SCTP  extension  that  assumes  multiple  independent 
paths  between  multihomed  end  points,  and  exploits  the  independent  paths  for 
simultaneous  transfer  of  new  data  (see  Chapter  1). 

Path  failures  arise  when  a  router  or  a  link  connecting  two  routers  fails  due 
to  planned  maintenance  activities  or  unplanned  accidental  reasons  such  as  hardware 
malfunction  or  software  error.  Ideally,  the  routing  system  detects  unplanned  link 
failures,  and  reconfigures  the  routing  tables  to  avoid  routing  traffic  via  the  failed  link. 


94 


Using  data  from  an  ISP’s  routing  logs,  [Markopoulou  2004]  observes  that  link  failures 
are  part  of  everyday  operation.  Around  80%  of  the  failures  are  unplanned,  and  the 
time-to-repair  for  any  particular  failure  can  be  on  the  order  of  hours.  Existing  research 
also  highlights  problems  with  Internet  backbone  routing  that  result  in  long  route 
convergence  times.  [Labovitz  2000]  shows  that  Internet's  interdomain  routers  may  take 
as  long  as  tens  of  minutes  to  reconstruct  new  paths  after  a  failure.  During  these  delayed 
convergences,  end-to-end  Internet  paths  experience  intermittent  loss  of  connectivity  in 
addition  to  increased  packet  loss,  latency,  and  reordering. 

Using  probes,  [Paxson  1997]  and  [Zhang  2000]  find  that  “significant 
routing  pathologies”  prevent  selected  pairs  of  hosts  from  communicating  about  1.5%  to 
3.3%  of  the  time.  Importantly,  the  authors  also  find  that  this  trend  has  not  improved 
with  time.  Reference  [Labovitz  1999]  examines  routing  table  logs  of  Internet 
backbones  to  find  that  10%  of  all  considered  routes  were  available  less  than  95%  of  the 
time,  and  more  than  65%  of  all  routes  were  available  less  than  99.99%  of  the  time.  The 
duration  of  these  path  outages  were  heavy-tailed  and  about  40%  of  path  outages  took 
more  than  30  minutes  to  repair.  In  [Chandra  2001],  the  authors  use  probes  to  confirm 
that  failure  durations  are  heavy-tailed,  and  report  that  5%  of  detected  failures  last  more 
than  2.75  hours,  and  as  long  as  27.75  hours.  The  pervasiveness  of  path  failures  in 
practice  motivates  us  to  study  their  impact  on  CMT. 

4.2  CMT  Performance  during  Path  Failure 

This  section  gives  an  overview  of  CMT’s  failure  detection  procedure  and 
describes  how  CMT’s  performance  suffers  during  path  failures. 
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4.2.1  Failure  Detection  in  CMT 


Since  CMT  is  an  extension  to  SCTP,  CMT  retains  SCTP’s  failure 
detection  process.  A  CMT  sender  uses  a  tunable  failure  detection  threshold  called 
Path. Max. Retrans  (PMR)  [RFC4960].  As  shown  in  the  finite  state  maehine  of  Figure 
4.1,  a  destination  is  in  one  of  the  two  states  -  active  or  failed  (inactive).  A  destination 
is  active  as  long  as  acks  eome  back  for  data  or  heartbeats  (probes)  sent  to  that 
destination.  When  a  sender  experiences  more  than  PMR  conseeutive  timeouts  while 
trying  to  reach  a  specific  active  destination,  that  destination  is  marked  as  failed.  Only 
heartbeats  (i.e.,  no  data)  are  sent  to  a  failed  destination.  A  failed  destination  returns  to 
the  active  state  when  the  sender  reeeives  a  heartbeat  ack.  RFC4960  proposes  a  default 
PMR  value  of  5,  whieh  translates  to  at  least  63  seeonds  (6  consecutive  timeouts)  for 
failure  detection. 


Timeout  and 
<  PMR  consecutive 

timeouts  Timeout 


Figure  4. 1 :  Failure  Detection  in  CMT 


4.2.2  Receive  Buffer  Blocking  in  CMT 

[Iyengar  2005]  explores  the  “rbuf  blocking”  problem  in  CMT,  where 
TPDU  losses  throttle  data  transmission  onee  the  CMT  receiver’s  buffer  (rbuf)  is  filled 
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with  out-of-order  data.  Even  though  the  cwnd  would  allow  new  data  to  be  transmitted, 
rbuf  blocking  (i.e.,  flow  control)  stalls  the  sender,  causing  throughput  degradation. 

Rbuf  blocking  problem  cannot  be  eliminated  in  CMT  [Iyengar  2005].  To 
reduce  rbuf  blocking’s  negative  impact  during  congestion,  [Iyengar  2005]  proposes 
different  retransmission  poUcies  that  use  heuristics  for  faster  loss  recovery.  These 
policies  consider  different  path  properties  such  as  loss  rate  and  delay,  and  try  to  reduce 
rbuf  blocking  by  sending  retransmissions  on  a  path  with  lower  loss  or  delay.  In  practice, 
the  loss  rate  of  a  path  can  only  be  estimated,  so  [Iyengar  2005]  proposed  the 
RTX_SSTHRESH  policy,  where  retransmissions  are  sent  on  the  path  with  the  largest 
slow-start  threshold.  Since  RTX_SSTHRESH  outperformed  other  retransmission 
policies  during  congestion,  [Iyengar  2005]  recommended  the  RTX_SSTHRESH  policy 
for  CMT.  However,  [Iyengar  2005]  did  not  consider  CMT  performance  during  path 
failures.  As  we  shall  show,  CMT  with  the  RTX_SSTHRESH  policy  suffers  from 
significant  rbuf  blocking  during  path  failures. 

4.2.3  Rbuf  Blocking  during  Path  Failure 

CMT’s  failure-induced  rbuf  blocking  problem  is  modeled  via  the  timeline 
shown  in  Figure  4.2.  The  CMT  sender  (A)  has  two  interfaces  -  Ai  and  A2,  and 
transmits  data  to  a  receiver  (B)  with  two  interfaces  -  Bi  and  B2.  All  four  addresses  are 
bound  in  the  CMT  association  such  that  the  sender  employs  the  2  independent  paths  - 
path  1  and  path  2,  for  data  transmission.  Ci  and  Oi  denote  the  cwnd  in  number  of 
MTUs,  and  the  number  of  outstanding  TPDUs,  respectively,  on  path  i.  The  initial  cwnd 
for  each  path=2  MTUs.  The  data  transfer  example  also  assumes  the  following  for  easier 
illustration:  (a)  each  SCTP  PDU  contains  a  single  data  chunk  resulting  in  a  one-to-one 
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correspondence  between  an  SCTP  PDU  and  TSN,  and  (b)  each  SCTP  PDU  is  MTU- 
sized. 

In  Figure  4.2,  a  SACK  labeled  <Sa,  b-c;  Rd>  acknowledges  aU  TSNs  upto 
and  including  the  cumulative  TSN  value  of  a,  in-order  arrival  of  TSNs  b  through  c 
(missing  report  for  TSNs  a+1  through  b-1),  and  an  advertised  receiver  window^ 
capable  of  buffering  d  more  TSNs.  On  receiving  a  SACK,  sender  A  subtracts  the 
number  of  outstanding  TSNs  from  the  advertised  receiver  window,  and  calculates  the 
amount  of  new  data  that  can  be  sent  without  overflowing  the  receive  buffer.  The 
transport  layer  receive  buffer  for  this  example  can  hold  a  maximum  of  5  TSNs,  and  its 
contents  are  listed  after  the  reception  of  every  TSN. 

In  the  example,  both  forward  and  reverse  paths  between  Ai  and  Bi  fail  just 
after  TSN  2  enters  the  network.  Hence,  TSN  2  and  the  SACK  for  TSN  1  are  presumed 
lost.  TSNs  3  and  4  arrive  out  of  order,  each  trigger  a  SACK,  and  are  stored  in  the 
receive  buffer.  The  CMT  sender  uses  the  Cwnd  Update  for  CMT  (CUC)  algorithm 
[Iyengar  2006]  to  decouple  a  path’s  cwnd  evolution  and  data  ordering.  On  receiving 
the  SACK  triggered  by  TSN  3,  the  sender  uses  CUC  to  increment  C2  to  3,  and 
decrement  Oi  and  O2  to  1.  The  available  receive  buffer  space  for  new  data  is  calculated 
as  advertised  receive  window  (4)  -  total  outstanding  TSNs  in  the  association  (2).  This 
available  receive  buffer  space  allows  the  sender  to  transmit  two  TSNs,  5  and  6,  on  path 

1  Advertised  receiver  window  {a_rwnd)  has  different  connotations  in  TCP  and  SCTP. 
TCP’s  a_rwnd  denotes  the  available  memory  in  rbuf,  starting  from  the  left  edge  of 
received  sequence  space  [RFC793].  SCTP’s  a_rwnd  denotes  the  available  memory 
after  considering  aU  TPDUs  not  yet  delivered  to  the  application  layer,  including  the 
out-of-order  TPDUs  [RFC4960]. 
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2.  On  path  1,  even  though  1  MTU  worth  of  new  data  can  be  transmitted  (Ci  >  Oi), 
rbuf  blocking,  i.e.,  flow  control  stalls  data  transmission. 


Figure  4.2:  Rbuf  Blocking  in  CMT  during  Failure 


On  receiving  the  SACK  triggered  by  TSN  4,  the  sender  increases  C2  to  4, 


and  decreases  O2  to  2.  Lack  of  receive  buffer  space  (advertised  receive  window  -  total 
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number  of  outstanding  TSNs)  continues  to  prevent  transmission  of  new  data  on  path  2. 
Since  O2  <  C2,  the  SACKs  triggered  by  TSNs  5  and  6  do  not  increment  C2  [RFC4960] 
(discussed  later).  But  these  SACKs  decrement  O2.  Even  though  O2  <  C2,  rbuf  blocking 
stalls  data  transmission  on  path  2. 

Path  I’s  retransmission  timer  expires  and  the  sender  detects  the  loss  of 
TSN  2.  Note  that  this  timeout  is  the  first  of  the  6  (PMR  =  5)  consecutive  timeouts 
needed  to  detect  path  1  failure.  After  this  timeout,  Ci  is  set  to  1,  Oi  is  set  to  0,  and  path 
I’s  RTO  value  is  doubled  [RFC4960].  The  CMT  sender  employs  the 
RTX_SSTHRESH  policy  and  retransmits  TSN  2  on  path  2.  Data  cannot  be  transmitted 
on  path  1  due  to  rbuf  blocking. 

On  receiving  TSN  2,  the  receiver  delivers  data  from  TSNs  2-6  to  the 
application.  The  corresponding  SACK  advertises  a  receive  window  of  5  TSNs,  and 
concludes  the  current  rbuf  blocking  instance.  The  sender  now  transmits  TSN  7  on  path 
1,  and  TSNs  8-11  on  path  2.  Due  to  path  1  failure,  TSN  7  is  lost,  and  TSNs  8-11  are 
received  out-of-order  and  stored  in  the  receiver’s  buffer.  The  SACK  triggered  by  TSN 
8  increments  C2  to  5  and  decrements  O2  to  3.  The  available  receive  buffer  space  for 
new  data=0,  triggering  another  instance  of  rbuf  blocking,  which  stalls  data  transmission 
until  TSN  7  is  successfully  retransmitted.  Note  that  the  loss  of  TSN  7  can  be  recovered 
only  after  a  timeout  on  path  1,  and  due  to  the  exponential  backoff  algorithm,  path  I’s 
current  RTO  value  is  twice  the  previous  value. 

To  generalize,  sender  A  transmits  new  data  on  path  1  until  (PMR  -1-  1) 
number  of  consecutive  timeouts  mark  path  1  as  failed.  During  failure  detection,  data 
transmitted  on  non-failed  path(s)  arrive  out-of-order,  resulting  in  consecutive  rbuf 
blocking  instances.  Each  rbuf  blocking  instance  concludes  when  the  sender  retransmits 
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lost  TPDUs  after  an  RTO.  The  length  of  an  rbuf  blocking  instance  is  therefore 
proportional  to  the  failed  path’s  RTO.  Also,  each  rbuf  blocking  instance  is 
exponentially  longer  than  the  previous  instance  due  to  the  exponential  backoff  of  RTO 
values. 

Rbuf  blocking  results  in  the  following  side-effects  that  further  degrade 
CMT’s  throughput: 

Preventing  congestion  window  growth:  Note  that  rbuf  blocking  prevents 
the  sender  from  fully  utilizing  the  cwnd.  When  the  amount  of  outstanding  data  is  less 
than  the  cwnd,  RFC4960  prevents  the  sender  from  increasing  the  cwnd  for  future 
SACKs.  For  example,  in  Figure  4.2,  when  the  sender  receives  the  SACKs  for  TSNs  5, 
6,  9-1 1,  arrive,  the  sender  cannot  increment  C2. 

Reducing  congestion  window.  To  reduce  burstiness  in  data  transmission, 
an  SCTP  sender  employs  a  congestion  window  validation  algorithm  similar  to 
[RFC2861].  During  every  transmission,  the  sender  uses  the  MaxBurst  parameter 
(recommended  value  of  4)  as  follows: 

If  ((outstanding  +  MaxBurst  *  MTU)  <  Cwnd) 

Cwnd  =  outstanding  +  MaxBurst  *  MTU 

This  algorithm  reduces  the  cwnd  during  idle  periods  so  that  at  the  next 
sending  opportunity,  the  sender  cannot  transmit  more  than  (MaxBurst  *  MTU)  bytes  of 
data.  During  rbuf  blocking,  the  amount  of  outstanding  data  can  become  smaller  than 
the  cwnd.  In  such  cases,  the  above  rule  is  triggered  and  further  reduces  the  cwnd.  In 
Figure  4.2,  when  the  SACK  triggered  by  TSN  1 1  arrives  at  the  sender,  O2  decrements 
to  0.  The  window  validation  algorithm  causes  C2  to  be  reduced  to  4  (O2  (0)  -1- 
MaxBurst  (4)). 
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4.3  CMT  with  Potentially  Failed  Destination  State 

[Caro  2005]  recommends  lowering  the  value  of  PMR  for  SCTP  flows  in 
Tnternet-like  environments.  Correspondingly,  lowering  the  PMR  for  CMT  flows 
reduces  the  number  of  rbuf  blocking  episodes  during  failure  detection.  However, 
lowering  the  PMR  is  an  incomplete  solution  to  the  problem  since  a  CMT  flow  is  rbuf 
blocked  for  any  PMR  >  0  (discussed  later).  Also,  a  tradeoff  exists  on  deciding  the  value 
of  PMR  -  a  lower  value  reduces  rbuf  blocking  but  increases  the  chances  of  spurious 
failure  detection,  whereas  a  higher  PMR  increases  rbuf  blocking  and  reduces  spurious 
failure  detection  in  a  wide  range  of  environments. 

4.3.1  Details  of  CMT-PF 

To  mitigate  the  recurring  instances  of  rbuf  blocking  during  path  failures, 
our  proposed  solution  introduces  a  new  destination  state  called  “potentially-failed”  in 
the  FSM  of  Figure  4.1.  The  new  FSM,  shown  in  Figure  4.3,  is  based  on  the  rationale 
that  loss  detected  by  a  timeout  implies  either  severe  congestion  or  failure  en  route. 
After  a  single  timeout  on  a  path,  a  sender  is  unsure,  and  marks  the  corresponding 
destination  as  “potentially-failed”  (PF).  A  PF  destination  is  not  used  for  data 
transmission  or  retransmission.  CMT’s  retransmission  policies  are  augmented  to 
include  the  PF  state.  CMT  with  the  new  set  of  retransmission  policies  is  called  CMT- 
PF  [Natarajan  2006b].  Details  of  CMT-PF  are: 

•  If  a  TPDU  loss  is  detected  by  RFC4960’s  threshold  number  of  missing 
reports,  one  of  CMT’s  current  retransmission  policies,  such  as 
RTX_SSTHREH,  is  used  to  select  an  active  destination  for  “fast” 
retransmissions. 
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•  If  a  TPDU  loss  is  detected  by  a  timeout,  the  corresponding  destination 
transitions  to  the  PF  state  (Figure  4.3).  The  sender  does  not  transmit  data 
to  a  PF  destination.  However,  when  aU  destinations  are  in  the  PF  state,  the 
sender  transmits  data  to  the  destination  with  the  least  number  of 
consecutive  timeouts.  In  case  of  tie,  data  is  sent  to  the  last  active 
destination.  This  exception  ensures  that  CMT-PF  does  not  perform  worse 
than  CMT  when  aU  paths  have  potentially  failed  (discussed  further  in 
Section  4.6). 


Data  Acks  Timeout 


timeouts 

Figure  4.3:  CMT-PF  Failure  Detection  (PMR  >  0) 

•  Heartbeats  are  sent  to  PF  destination(s)  with  an  exponential  backoff  of 
RTO  after  every  timeout  until  either  (i)  a  heartbeat  ack  transitions  the 
destination  back  to  the  active  state,  or  (ii)  an  additional  PMR  consecutive 
timeouts  confirm  the  path  failure,  upon  which  the  destination  transitions  to 
the  failed  state,  and  heartbeats  are  sent  with  a  lower  frequency  as  described 
in  RFC4960. 
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•  Once  a  heartbeat  ack  indicates  a  PF  destination  is  alive,  the  destination’s 
cwnd  is  set  to  either  1  MTU  (CMT-PFl),  or  2  MTUs  (CMT-PF2),  and  the 
sender  follows  the  slow  start  algorithm  to  transmit  data  to  this  destination. 
Detailed  analysis  on  the  cwnd  evolution  of  CMT-PFl  vs.  CMT-PF2  can  be 
found  in  Section  4.6. 

•  Acks  for  retransmissions  do  not  transition  a  PF  destination  to  the  active 
state,  since  a  sender  cannot  determine  whether  the  ack  was  for  the  original 
transmission  or  the  retransmission(s). 

4.3.2  CMT-PF  Data  Transfer  during  Failure 

Figure  4.4  depicts  an  analogous  CMT-PF  timeline  for  the  scenario 
described  in  Figure  4.2.  All  events  are  identical  between  the  two  figures  up  to  the  first 
timeout  on  path  1.  After  this  timeout,  the  CMT-PF  sender  transitions  path  1  to  the  PF 
state,  transmits  a  heartbeat  on  path  1,  and  retransmits  TSN  2  on  path  2.  The  heartbeat 
loss  on  the  failed  path  (path  1)  is  detected  on  the  next  timeout.  This  timeout  is  the 
second  of  (PMR  -i-  1)  consecutive  timeouts  required  to  detect  path  1  failure. 
Meanwhile,  receiver  buffer  space  is  released  once  the  retransmitted  TSN  2  is  received 
on  path  2.  From  this  point  onwards,  data  is  transmitted  only  on  path  2,  without  further 
rbuf  blocking. 

4.4  CMT  vs.  CMT-PF  Evaluations  during  Failure 

CMT-PF  was  implemented  in  the  University  of  Delaware’s  SCTP/CMT 
module  for  the  ns-2  network  simulator  [NS-2,  Ekiz  2007].  This  section  discusses  the 
performance  of  CMT  vs.  CMT-PF  during  permanent  and  short-term  failure  scenarios. 
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Figure  4.4:  CMT-PF  Reduces  Rbuf  Blocking  during  Failure 


In  the  simulation  topology  (Figure  4.5),  the  multihomed  sender,  A,  has  two 
independent  paths  to  the  multihomed  receiver,  B.  The  edge  links  between  A  (or  B)  to 
the  routers  represent  last-hop  link  characteristics.  The  end-to-end  one-way  delay  is 
45ms  on  both  paths,  representing  typical  coast-to-coast  delays  experienced  by 
significant  fraction  of  the  flows  in  the  Internet  [Shakkottai  2004].  We  note  that  the  final 
conclusions  regarding  CMT  vs.  CMT-PF  are  independent  of  the  actual  bandwidth  and 
delay  configurations  used  in  the  topology,  as  long  as  these  configurations  are  similar  on 
both  paths. 
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Path  1 


Path  2 

Figure  4.5:  Topology  for  Failure  Experiments 

The  sender  A  transfers  an  SMB  file  to  receiver  B  using  both  path  1  and 
path  2.  Path  2  fails  during  the  file  transfer;  this  failure  is  simulated  by  bringing  down  the 
bidirectional  link  between  routers  R20  and  R21.  Unless  stated  otherwise,  the  PMR=5, 
rbuf=64KB,  and  both  paths  experience  Bernoulli  losses  with  low  loss  rate  (1%).  We 
acknowledge  that  the  Bernoulli  loss  model  is  less  realistic  than  the  nature  of  losses 
observed  in  the  Internet.  Since  evaluations  in  this  Section  assume  failure  scenarios  and 
rare  loss  events  (1%  or  no  loss),  we  expect  the  final  conclusions  between  CMT  vs. 
CMT-PF  to  remain  similar  even  with  a  more  realistic  loss  model 

4.4.1  Evaluations  during  Permanent  Failure 

In  the  following  experiments,  path  2  fails  permanently  5  seconds  after  the 
file  transfer  begins. 
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4.4.1. 1  Evaluations  during  Single  Permanent  Failure  (without  Congestion) 

This  experiment  highlights  the  essential  differences  between  CMT  and 
CMT-PF  during  a  permanent  path  failure.  To  eliminate  the  influence  of  congestion- 
induced  rbuf  blocking,  the  simulation  is  setup  such  that  the  sender  does  not  experience 
any  congestion  losses  on  either  paths. 


The  path  2  failure  causes  back-to-back  timeouts  at  the  sender.  Both 
senders  (CMT  and  CMT-PF)  experience  the  first  timeout  on  path  2  at  ~6  seconds,  and 
detect  the  failure  after  6  back-to-back  timeouts  (PMR=5),  at  ~69  seconds  (Figure  4.6). 
During  the  failure  detection  period,  CMT  continues  to  transmit  data  on  path  2, 
experiencing  consecutive  timeouts  and  recurring  rbuf  blocking  instances,  while  CMT- 
PF  does  not.  CMT’s  throughput  suffers  until  69  seconds  (until  failure  detection),  after 
which  CMT  uses  path  1  alone  and  completes  the  file  transfer  at  around  80  seconds.  On 
the  other  hand,  CMT-PF  transitions  path  2  to  PF  state  after  the  first  timeout,  and 
transmits  only  heartbeats  on  path  2  avoiding  further  rbuf  blocking.  Reduced  rbuf 
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blocking  helps  CMT-PF  to  complete  the  file  transfer  (~15  seconds)  using  path  1  alone, 
even  before  path  2  failure  is  detected. 

4.4.1.2  Evaluations  during  Varying  Failure  Detection  Thresholds  (PMR  Values) 

To  achieve  faster  yet  robust  failure  detection,  [Acaro  2005]  argues  for 
varying  the  PMR  based  on  a  network’s  loss  rate,  and  suggests  PMR=3  for  the  Internet. 
Since  the  sender  detects  path  failure  after  (PMR+1)  consecutive  timeouts,  CMT’s 
failure-induced  rbuf  blocking  varies  as  the  PMR  varies.  Let, 

Tf  =  time  when  path  2  fails,  and 

Td  =  time  when  the  sender  detects  path  2  failure  (after  PMR-i-1 

consecutive  timeouts). 

The  goodput  during  failure  detection  (G)  is  defined  as, 

G  =  (application  data  received  between  Tf  and  Td  ^  {Td  -  Tf)). 

Figure  4.7  plots  CMT  vs.  CMT-PF  average  goodput  (G)  (in  KB/second) 
with  5%  error  margin,  for  varying  PMR  values.  The  dashed  line  in  Figure  4.7  denotes 
the  maximum  attainable  goodput  of  an  SCTP  file  transfer  (application  data  received  ^ 
transfer  time)  using  path  1  alone. 

When  the  failure  detection  threshold  is  most  aggressive  (PMR=0),  both 
CMT  and  CMT-PF  detect  path  2  failure  after  the  first  timeout.  The  senders  experience 
similar  rbuf  blocking  during  this  failure  detection  period  and  perform  similarly  (Figure 
4.7).  As  PMR  increases,  the  number  of  rbuf  blocking  instances  during  failure  detection 
increases,  resulting  in  increasing  performance  benefits  from  CMT-PF.  As  seen  in  Figure 
4.7  as  PMR  and  the  failure  detection  duration  increases,  CMT-PF’ s  goodput  increases, 
whereas  CMT’s  goodput  decreases.  Starting  from  PMR=3,  CMT-PF’s  goodput  is 
comparable  or  equal  to  the  maximum  attainable  SCTP  goodput.  To  conclude,  during 
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permanent  failure,  CMT-PF  performs  as  well  as  CMT  for  PMR=0,  and  better  than 


CMTfor  PMR  >  0. 


Path  Maximum  Retransmit  (PMR) 

Figure  4.7:  CMT  vs.  CMT-PF  under  Varying  PMR  Values 


4.4.2  Evaluations  during  Short-term  Failure 

In  the  following  experiments,  path  2  (Figure  4.5)  fails  temporarily  during 
the  file  transfer  between  A  and  B.  The  link  between  routers  R20  and  R21  goes  down 
after  5  seconds  from  the  start  of  file  transfer,  and  is  restored  5  seconds  later. 

4.4.2. 1  Evaluations  during  Single  Short-term  Failure  (without  Congestion) 

This  experiment  highlights  how  CMT  and  CMT-PF  differ  during  a  short¬ 
term  failure.  Neither  path  experiences  any  congestion  loss.  The  short-term  failure  is 
long  enough  for  the  sender  (CMT  or  CMT-PF)  to  experience  three  back-to-back 
timeouts  on  path  2.  As  in  the  failure  case,  CMT  transmits  data  on  path  2  after  each  of 
these  timeouts,  while  CMT-PF  does  not.  Therefore,  CMT  suffers  from  consecutive 
rbuf  blocking  and  lower  throughput  than  CMT-PF  (Figure  4.8).  Once  path  2  recovers 
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at  10  seconds,  CMT’s  data  and  CMT-PF’s  heartbeat  transmissions  on  the  path  (after 
the  3'^‘*  timeout  -  -12.5  seconds)  are  successful,  and  both  CMT  and  CMT-PF  complete 
the  file  transfer  without  further  rbuf  blocking. 


4.4.2.2  Evaluations  during  Varying  Receive  Buffer  Sizes 

This  second  short-term  failure  experiment  analyzes  CMT  vs.  CMT-PF 
under  varying  levels  of  receive  buffer  constraints  (receive  buffer  sizes).  Let, 

Tf  =  time  when  path  2  fails,  and 

Tr  =  time  when  path  2  is  restored. 

The  goodput  during  the  short-term  failure  (G)  is  defined  as, 

G  =  (application  data  received  between  Tf  and  Tr  ^  {Tr  -  Tf)). 

Figure  4.9  plots  CMT  vs.  CMT-PF  average  goodput  (G)  (in  KB/second), 
with  5%  error  margin.  As  the  receive  buffer  becomes  more  constrained,  i.e.,  as  rbuf 
size  decreases,  the  chances  of  rbuf  blocking  increases.  Consequently,  CMT-PF’s  ability 
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to  alleviate  rbuf  blocking  is  more  valuable  at  smaller  rbuf  sizes,  and  CMT-PF  performs 
increasingly  better  than  CMT  as  rbuf  size  decreases. 


16K  32K  64K  128K  256K  51 2K 


Receive  Buffer  Size 


Figure  4.9:  CMT  vs.  CMT-PF  under  Varying  Rbuf  Sizes 


4.5  CMT  vs.  CMT-PF  Evaluations  during  Congestion 

The  evaluations  in  the  previous  section  confirmed  that  transitioning  a 
destination  to  the  PF  state  and  avoiding  data  transmission  on  the  PF  path  alleviates 
failure-induced  rbuf  blocking.  During  permanent  and  short-term  failure  scenarios, 
CMT-PF  performed  similar  or  better  but  never  worse  than  CMT.  We  now  investigate 
how  the  PF  state  transition  fares  when  timeouts  are  caused  by  non-failure  scenarios 
such  as  congestion  [Natarajan  2008b]. 

Consider  the  case  when  timeout  on  a  path,  say  p,  is  due  to  congestion 
rather  than  failure.  Depending  on  the  rbuf  size  and  the  different  paths’  characteristics, 
the  transport  sender  may  or  may  not  be  rbuf  blocked  until  and/or  after  the  timeout 
expiration,  leading  to  the  following  two  scenarios: 


111 


Sender  is  limited  by  rbuf:  Both  CMT  and  CMT-PF  senders  cannot  transmit 
new  data  until  the  rbuf  blocking  is  cleared,  i.e.,  until  after  successful  retransmission(s) 
of  lost  data.  The  only  difference  is  that  CMT  considers  p  for  retransmissions,  whereas 
CMT-PF  transmits  a  heartbeat  on  p,  and  tries  to  retransmit  lost  data  on  other  active 
paths.  (If  aU  destinations  are  in  the  PF  state,  the  CMT-PF  sender  transitions  the 
destination  with  the  least  number  of  consecutive  timeouts  to  the  active  state  (Section 
4.3),  and  retransmits  lost  data  to  this  new  active  destination.) 

Sender  is  not  limited  by  rbuf:  Assume  that  SCTP  PDUs  (data  or 
heartbeats)  transmitted  after  the  first  timeout  on  path  p  successfully  reach  the  receiver. 
In  CMT,  the  cwnd  allows  1  MTU  worth  of  new  data  transmission  on  p  (Figure  4.10), 
and  the  corresponding  SACK  increments  path  p’s  cwnd  by  1  MTU.  At  the  end  of  1 
RTT  after  the  timeout  (shown  by  point  A  in  Figure  4.10),  (i)  the  cwnd  on  p=2  MTUs, 
and  (ii)  1  MTU  worth  of  new  data  has  been  successfully  sent  on  p. 

CMT-PF  transmits  a  heartbeat  on  p  and  new  data  on  other  active  path(s). 
(Note:  if  all  destinations  are  marked  PF,  the  CMT-PF  sender  transitions  a  PF 
destination  to  the  active  state.)  Path  p  is  marked  active  when  the  heartbeat  ack  reaches 
the  sender.  Therefore,  after  1  RTT  from  the  timeout  (shown  by  point  B  in  Figure  4.1 1), 
(i)  cwnd  on  p  =1  MTU  (CMT-PFl),  and  (ii)  no  new  data  has  been  sent  on  p. 
Comparing  points  A  and  B  in  Figures  4.10  and  4.11,  respectively,  it  can  be  seen  that 
CMT  has  a  1  RTT  “lead”  in  path  p’s  cwnd  growth.  Assuming  no  further  losses  on  p, 
after  n  RTTs,  the  cwnd  on p  will  be  2n  with  CMT,  and  2n-l  with  CMT-PFl. 
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Figure  4.10:  CMT  Data  Transfer  during  no  Rbuf  Blocking 


Receiver  B1 


A1  Sender  A2 


Receiver  B2 


Figure  4.1 1:  CMT-PFl  Data  Transfer  during  no  Rbuf  Blocking 


113 


Receiver  B1  A1  Sender  A2  Receiver  B2 


Figure  4.12:  CMT-PF2  Data  Transfer  during  no  Rbuf  Blocking 

To  avoid  the  1  RTT  lag  in  CMT-PFl’s  cwnd  evolution,  we  propose  CMT- 
PF2  which  initializes  path  p’s  cwnd  to  2  MTUs  after  receiving  a  heartbeat  ack  (shown 
by  point  C  in  Figure  4.12).  Assuming  that  today’s  Internet  router  queues  deal  with 
packets  rather  than  bytes,  the  successful  routing  of  a  heartbeat  PDU  is  equivalent  to  the 
successful  routing  of  a  data  PDU.  Hence,  a  heartbeat  ack  can  be  used  to  clock  the 
transport  layer  sender  in  the  same  way  as  a  data  ack.  In  the  following  sections,  any 
reference  to  CMT-PF  implies  CMT-PF2. 

4.5.1  Simulation  Setup 

The  simulation  evaluations  consider  a  dual-dumbbeU  topology  with  a  more 
realistic  loss  model,  as  shown  in  Figure  4.13.  Each  router,  R,  is  attached  to  five  edge 
nodes.  Dual-homed  edge  nodes  A  and  B  are  the  transport  (CMT  or  CMT-PF)  sender 
and  receiver,  respectively.  The  other  edge  nodes  are  single-homed,  and  introduce 
cross-traffic  that  instigates  bursty  periods  of  congestion  and  bursty  congestion  losses  at 
the  routers.  Their  last-hop  propagation  delays  are  randomly  chosen  from  a  uniform 
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distribution  between  5-20  ms,  resulting  in  end-to-end  one-way  propagation  delays 
ranging  ~35-65ms  [Shakkottai  2004],  All  links  (both  edge  and  core)  have  a  buffer  size 
twice  the  link's  bandwidth-delay  product,  which  is  a  reasonable  setting  in  practice. 


Each  single-homed  edge  node  has  eight  traffic  generators,  introducing 
cross-traffic  with  a  Pareto  distribution.  The  cross-traffic  packet  sizes  are  chosen  to 
resemble  the  distribution  found  on  the  Internet:  50%  are  44B,  25%  are  576B,  and  25% 
are  1500B  [CAIDA,  Fraleigh  2003].  The  cross-traffic  flows  start  at  random  times 
during  the  initial  5  seconds  of  the  simulation.  After  an  initial  warm-up  period  of  10 
seconds,  sender  A  transmits  a  32MB  file  to  receiver  B  over  paths  1  and  2.  The  result  is 
a  data  transfer  between  A  to  B,  over  a  network  with  self-similar  cross-traffic,  which 
resembles  the  observed  nature  of  traffic  on  data  networks  [Leland  1993]. 
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For  both  CMT  and  CMT-PF  flows,  rbuf=64KB,  PMR=5,  and  loss  rates  are 


controlled  by  varying  the  cross-traffic  load.  The  graphs  in  the  subsequent  discussions 
plot  the  average  goodput  (file  size  transfer  time)  of  CMT  vs.  CMT-PF  with  5%  error 
margin. 

4.5.2  Evaluations  during  Synunetric  Loss  Conditions 

In  the  symmetric  loss  case,  the  aggregate  cross-traffic  load  on  both  paths 
are  similar  and  vary  from  40% -100%  of  the  core  link’s  bandwidth. 

4.5.2. 1  Evaluations  during  Symmetric  Path  Delays 

Both  CMT  and  CMT-PF  perform  similarly  (Figure  4.14)  during  low  loss 
rates  (i.e.,  low  cross-traffic),  since,  most  of  the  TPDU  losses  are  recovered  via  fast 
retransmits  as  opposed  to  timeout  recoveries. 
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Figure  4.14:  CMT  vs.  CMT-PF  during  Symmetric  Loss  and  RTT  Conditions 
As  the  cross-traffic  load  and  loss  rate  increases,  the  number  of  timeouts  on  each  path 
increases.  Under  such  conditions,  the  probability  that  both  paths  are  simultaneously 
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marked  “potentially-failed”  inereases  in  CMT-PF.  To  ensure  that  CMT-PF  does  not 
perform  worse  when  aU  destinations  are  marked  PF,  CMT-PF  transitions  the 
destination  with  the  smallest  number  of  eonsecutive  timeouts  to  the  aetive  state, 
allowing  data  to  be  sent  to  that  destination  (refer  to  Seetion  4.3).  This  modifieation 
guarantees  that  CMT-PF  performs  on  par  with  CMT  even  when  both  paths  experienee 
high  loss  rates  (Figure  4.14). 

4.5.2.2  Evaluations  during  Asymmetric  Path  Delays 

Under  symmetrie  loss  eonditions,  we  now  study  how  a  path’s  RTT  affeets 
the  throughput  differenees  between  CMT  and  CMT-PF.  Note  that  any  differenee 
between  CMT  and  CMT-PF  transpires  only  after  a  timeout  on  a  path.  Assume  that  a 
path  experienees  a  timeout  event,  and  the  next  TPDU  loss  on  the  path  takes  plaee  after 
n  RTFs.  After  the  timeout,  CMT  slow  starts  on  the  path,  and  the  number  of  TPDUs 
transmitted  on  the  path  at  the  end  of  n  RTTs  =  1  -I-2-I-4  ...  +  2n  =  (2(n  -i-l)  -  1). 
CMT-PF  uses  the  first  RTT  for  a  heartbeat  transmission,  and  slow  starts  with  initial 
ewnd=2  after  reeeiving  the  heartbeat-aek.  In  CMT-PF,  the  number  of  TPDUs 
transmitted  by  end  of  n  RTTs  on  the  path  =  0-i-2-i-4  ...  -i-2n  =  (2(n  -i-l)  -  2).  Thus, 
after  n  RTTs,  CMT  transmits  1  TPDU  more  than  CMT-PF,  and  the  1  TPDU  differenee 
is  unaffected  by  the  path’s  RTT.  Therefore,  when  paths  experience  symmetric  RTTs 
(a.k.a.  symmetric  RTT  conditions),  we  expect  the  performance  ratio  between  CMT  and 
CMT-PF  to  remain  unaffected  by  the  RTT  value. 

We  now  consider  a  more  interesting  scenario  when  the  independent  end- 
to-end  paths  experience  symmetric  loss  rates,  but  asymmetric  RTT  conditions.  That  is, 
path  I’s  RTT=v  sec,  and  path  2’s  RTT=y  sec  (x  y).  How  do  x  and  y  impact  CMT  vs. 
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CMT-PF  performance?  More  importantly,  does  CMT-PF  perform  worse  when  the 
paths  have  asymmetric  RTFs? 

Using  topology  in  Figure  4.5,  we  performed  the  following  Bernoulli  loss 
model  experiment  to  gain  insight.  The  Bernoulli  loss  model  simulations,  while  less 
realistic,  take  much  less  time  than  cross-traffic  ones,  and  initial  investigations  revealed 
that  both  loss  models  resulted  in  similar  trends  between  CMT  and  CMT-PF.  Path  I’s 
one-way  propagation  delay  was  fixed  at  45ms  while  path  2’s  one-way  delay  varied  as 
follows:  45ms,  90ms,  180ms,  360ms,  and  450ms.  Both  paths  experience  identical  loss 
rates  ranging  from  1%-10%. 
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Path  1  and  Path  2  Loss  Rates  (%) 

Figure  4.15:  CMT  vs.  CMT-PF  Goodput  Ratios  during  Symmetric  Loss  and 

Asymmetric  RTT  Conditions 


Figure  4.15  plots  the  ratio  of  CMT’s  goodput  over  CMT-PF’s  (relative 
performance  difference)  with  5%  error  margin.  As  expected,  both  CMT  and  CMT-PF 
perform  equally  well  during  symmetric  RTT  conditions.  As  the  asymmetry  in  paths’ 
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RTTs  increases,  an  interesting  dynamic  dominates  and  CMT-PF  performs  slightly 
better  than  CMT  (goodput  ratios  <  1). 

Further  investigation  revealed  the  following  about  CMT  vs.  CMT-PF  rbuf 
blocking  durations,  shown  in  Figure  4.16.  For  each  combination  of  path  2’s  delay  and 
loss  rate.  Figure  4.16  plots  the  ratio  of  rbuf  blocked  durations  (CMT  over  CMT-PF) 
during  timeout  recoveries.  As  path  2  one-way  delay  and  loss  rate  increases,  the  ratio 
becomes  increasingly  greater  than  1,  signifying  that  a  CMT  sender  suffers  longer  rbuf 
blocking  durations  than  CMT-PF. 


Figure  4.16:  CMT  vs.  CMT-PF  Rbuf  Blocking  Durations 


Note  that  rbuf  blocking  depends  on  the  frequency  of  loss  events  (loss  rate), 
and  the  duration  of  loss  recovery.  As  loss  rate  increases,  the  probability  that  a  sender 
experiences  consecutive  timeout  events  on  the  path  increases.  After  the  first  timeout, 
CMT-PF  transitions  the  path  to  PF,  and  avoids  data  transmission  on  the  path  (as  long 
as  another  active  path  exists)  until  a  heartbeat-ack  confirms  the  path  as  active.  But,  a 
CMT  sender  suffers  back-to-back  timeouts  on  data  sent  on  the  path,  with  exponential 
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backoff  of  timeout  recovery  period.  As  path  2’s  RTT  increases,  path  2’s  RTO 
increases,  and  the  back-to-back  timeouts  on  data  result  in  longer  rbuf  blocking 
durations  in  CMT  than  CMT-PF.  Therefore,  as  path  2‘s  RTTs  increase,  CMT’s 
goodput  degrades  more  than  CMT-PF’ s,  and  the  goodput  ratio  decreases  (Figure 
4.15). 

In  summary,  during  symmetric  loss  conditions,  CMT  and  CMT-PF 
perform  equally  well  when  paths  experience  symmetric  RTT  conditions.  As  the  RTT 
asymmetry  increases,  CMT-PF  demonstrates  a  slight  advantage  at  higher  loss  rates. 


4.5.3  Evaluations  during  Asymmetric  Loss  Conditions 

In  the  asymmetric  loss  experiment,  paths  1  and  2  experience  different 
cross-traffic  loads.  The  aggregate  cross-traffic  load  on  path  1  is  set  to  50%  of  the  core 
link  bandwidth,  while  on  path  2  the  load  varies  from  50%-100%  of  the  core  link 
bandwidth. 


VaiitHit 

Path  2  Cioss-tiaffic  % 

S 

of  Consecutive  Timeouts 

2 

3 

4 

5 

CMT 

70 

0.49 

0.02 

0 

0 

CMT-PF 

0 

0 

0 

0 

CMT 

80 

1.13 

0.07 

0 

0 

CMT-PF 

0 

0 

0 

0 

CMT 

90 

3.73 

0.60 

0.09 

0.02 

CMT-PF 

0.02 

0.02 

0 

0 

CMT 

100 

9.42 

1.62 

0.18 

0.04 

CMT-PF 

0.04 

0.04 

0 

0 

Table  4.1:  CMT  vs.  CMT-PF  Mean  Consecutive  Data  Timeouts  on  Path  2 

As  discussed  in  the  previous  sub-section,  as  path  2’s  cross-traffic  load 
increases,  the  probability  that  a  sender  experiences  back-to-back  timeouts  on  path  2 
increases.  CMT  suffers  a  higher  number  of  consecutive  timeouts  on  data  (Table  4.1) 
resulting  in  more  extended  rbuf  blocking  periods  when  compared  with  CMT-PF. 
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Therefore,  as  path  2’s  cross-traffic  load  increases,  CMT-PF  performs  better  than  CMT 
(Figure  4.18). 
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Figure  4.18:  CMT  vs.  CMT-PF  during  Asymmetric  Loss  Conditions 


Table  4.2:  CMT  vs.  CMT-PF  Mean  Number  of  Transmissions 


The  asymmetric  loss  experiment  also  helps  to  understand  the  following 
difference  in  CMT  vs.  CMT-PF’s  transmission  strategy.  In  CMT,  RTX_SSTHRESH  is 
a  retransmission  policy,  and  is  not  applied  to  new  data  transmissions.  In  CMT-PF,  a 
path  is  marked  PF  after  a  timeout,  and  as  long  as  active  path(s)  exist,  CMT-PF  avoids 
retransmissions  on  the  PF  path.  Once  the  retransmissions  are  aU  sent,  CMT-PF’s  data 
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transmission  strategy  is  applied  to  new  data,  and  CMT-PF  avoids  new  data 
transmissions  on  the  PF  path.  As  shown  in  Table  4.2,  when  compared  to  CMT,  CMT- 
PF  reduces  the  number  of  (re)transmissions  on  the  higher  loss  rate  path  2  and 
(re)transmits  more  on  the  lower  loss  rate  path  1.  This  transmission  difference  (ratio  of 
transmissions  on  path  1  over  path  2)  between  CMT-PF  and  CMT  increases  as  the  paths 
become  more  asymmetric  in  their  loss  conditions. 

In  summary,  CMT-PF  does  not  perform  worse  than  CMT  during 
asymmetric  path  loss  conditions.  In  fact,  CMT-PF  is  a  better  transmission  strategy 
than  CMT,  and  performs  better  as  the  asymmetry  in  path  loss  increases. 

4.6  Conclusion,  Ongoing  and  Related  Work 

Using  simulations,  we  demonstrated  that  retransmission  policies  using 
CMT  with  a  “potentially- failed”  destination  state  (CMT-PF)  outperform  CMT  during 
permanent  and  short-term  failures.  During  permanent  failures,  CMT-PF  employs  a 
better  failure  detection  process  than  CMT  even  under  aggressive  failure  detection 
thresholds.  Investigations  during  symmetric  loss  conditions  revealed  that  CMT-PF 
performs  as  well  as  CMT  during  symmetric  path  RTTs,  and  slightly  better  when  the 
paths  experience  asymmetric  RTT  conditions.  Also,  CMT-PF  employs  a  better 
transmission  strategy  than  CMT  during  asymmetric  loss  conditions. 

Our  evaluations  conclude  that  CMT-PF  (i)  reduces  rbuf  blocking  during 
failure  scenarios,  and  (ii)  performs  on  par  or  slightly  better  than  CMT  during  non¬ 
failure  scenarios.  Since  our  findings  demonstrate  CMT-PF  performs  better  or  similar 
but  never  worse  than  CMT,  we  recommend  CMT  be  replaced  by  CMT-PF  in  existing 
and  future  implementations  and  RFCs. 
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4.6.1  CMT-PF  Implementation  in  FreeBSD 

Joeseph  Szymanski  extended  the  FreeBSD  CMT  implementation  to  include 
CMT-PF.  The  following  emulation  experiments  were  performed  using  this  FreeBSD 
implementation. 


Figure  4.19:  Emulation  Topology  for  CMT  vs.  CMT-PF  Experiments 


The  experimental  topology,  shown  in  Figure  4.19,  consists  of  three  nodes 
running  FreeBSD  7  —  a  client  node,  a  server  node,  and  a  third  node  running  the 
Dummynet  traffic  shaper  [Rizzo  1997].  The  server  and  client  nodes  are  connected  by 
two  independent  paths,  with  symmetric  bandwidth  and  propagation  delay 
characteristics.  The  paths  also  experience  Bernoulli  losses,  with  loss  rates  varying  from 
0%-10%.  The  forward  and  reverse  queue  sizes  for  both  paths  are  set  to  1000KB.  The 
transport  layer  receive  window=64KB,  and  PMR=5.  At  time  t=0,  the  server  initiates  a 
bulk  file  transfer  to  the  client. 


123 


4.6.1. 1  Single  Failure  Scenario 

To  validate  the  behavioral  differences  between  CMT  and  CMT-PF,  we 
emulated  a  single  failure  scenario,  similar  to  the  scenario  described  in  Section  4.4. 1.1. 
Neither  paths  experience  loss  in  this  experiment.  At  time  t=5,  path  2  fails;  this  failure  is 
emulated  by  setting  up  appropriate  Dummynet  rules  to  block  all  packets  traversing  on 
path  2  to  and  from  the  client  and  server,  respectively  Figure  4.20  plots  the  cumulative 
bytes  received  at  the  cUent  during  this  transfer. 


Figure  4.20:  CMT  vs.  CMT-PF  during  Permanent  Path  Failure 

Figure  4.20  can  be  compared  with  the  corresponding  simulation  results 
shown  in  Figure  4.6.  As  observed  in  the  simulations,  path  2  failure  causes  consecutive 
timeouts  and  rbuf  blocking  instances  in  CMT,  which  prevents  data  transmission  until 
failure  detection  (~69  seconds).  After  failure  detection,  CMT  transmits  data  using  only 
path  1,  and  finishes  the  file  transfer  ~80  seconds.  The  CMT-PF  sender  transitions  path 
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2  to  PF  after  the  first  timeout  (~6.5  seconds),  and  transmits  only  heartbeats  on  path  2. 
Data  transmission  continues  on  path  1  and  the  file  transfer  finishes  ~18  seconds. 

4.6.1.2  Symmetric  Loss  Conditions 

This  experiment  is  designed  to  compare  CMT  vs.  CMT-PF  under  varying 
congestion  levels.  Similar  to  the  scenario  described  in  Section  4.5.2. 1,  paths  1  and  2 
experience  symmetric  loss  rates,  varying  from  1%-10%.  Figure  4.21  plots  the  average 
goodput  (file  size  ^  transfer  time)  of  CMT  vs.  CMT-PF  with  5%  error  margin. 


As  observed  in  the  simulations  (Figure  4.14),  both  CMT  and  CMT-PF 
perform  similarly  during  low  loss  rates,  since,  most  of  the  TPDU  losses  are  recovered 
via  fast  retransmits  as  opposed  to  timeout  recoveries.  As  loss  rate  increases,  the 
probability  that  both  paths  are  simultaneously  marked  PF  increases  in  CMT-PF.  Unlike 
the  simulation  results,  CMT-PF  performs  slightly  worse  than  CMT  during  such  high 
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loss  conditions.  Further  investigation  exposed  few  potential  bugs  in  the  CMT-PF 
implementation.  We  are  currently  exploring  these  issues. 

4.6.2  CMT-PF  Applicability  during  Mobile  Handovers 

Mobile  SCTP  (mSCTP)  [Koh  2004,  Koh  2005]  provides  transport  layer 
features  sueh  as  multihoming  and  dynamie  address  reeonfiguration  [RFC5061]  to 
achieve  seamless  handover  in  the  context  of  heterogeneous  wireless  access  networks. 
[Budzisz  2008]  investigates  the  possibility  of  using  CMT  to  increase  throughput  of  an 
mSCTP  association  during  handover  seenarios.  Sinee  path  failures  are  common  in  a 
wireless  network,  [Budzisz  2008]  proposes  to  employ  CMT-PF  instead  of  CMT. 

Simulation  evaluations  presented  in  [Budzizs  2008]  show  that,  while  CMT- 
PF’ s  performanee  during  handover  is  sensitive  to  various  parameters,  CMT-PF  reduees 
rbuf  bloeking  and  improves  throughput  for  parameters  typieal  of  today’s  heterogeneous 
wireless  access  networks. 
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Chapter  5 

SUMMARY  AND  CONCLUSIONS 


This  dissertation  investigated  three  issues  related  to  the  transport  layer  and 
proposed  solutions  to  address  these  issues.  This  chapter  summarizes  our  contributions 
for  each  issue,  and  concludes  the  dissertation. 

5.1  Issue  (1):  Web  over  Multistreamed  Transport 

We  examined  HOT  blocking,  and  its  effects  on  web  response  times  in 
HTTP  over  TCP.  Since  a  multistreamed  transport  such  as  SCTP  eliminates  inter-object 
HOT  blocking,  we  hypothesized  that  SCTP  streams  would  improve  web  response 
times.  We  designed  and  implemented  HTTP  over  SCTP  in  the  open  source  Apache 
server  and  Firefox  browser.  Emulation  evaluations  showed  that  persistent  and  pipelined 
HTTP  1.1  transfers  over  a  single  multistreamed  SCTP  association  improves  web 
response  times  when  compared  to  similar  transfers  over  a  single  TCP  connection.  The 
difference  in  TCP  vs.  SCTP  response  times  increases  and  is  more  visually  perceivable  in 
the  high  latency  and  lossy  browsing  conditions  found  in  the  developing  world. 

The  current  workaround  to  improve  an  end  user’s  perceived  WWW 
performance  is  to  download  an  HTTP  transfer  over  multiple  TCP  connections.  While 
we  expected  multiple  TCP  connections  to  improve  HTTP  throughput,  emulation 
results  showed  that  the  competing  and  bursty  nature  of  multiple  TCP  senders  degraded 
HTTP  performance  especially  in  low  bandwidth  last  hops.  In  such  browsing  conditions. 
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a  single  multistreamed  SCTP  association  not  only  eliminated  HOL  blocking,  but  also 
boosted  throughput  compared  to  multiple  TCP  connections. 

Our  body  of  work  in  HTTP  over  SCTP  has  triggered  significant  interest  in 
the  area.  We  are  currently  working  with  the  IETF  to  standardize  our  HTTP  over  SCTP 
streams  design. 

5.2  Issue  (2):  Reneging  and  Selective  Acks 

We  investigated  how  the  existing  SACK  mechanism  degrades  end-to-end 
performance  when  out-of-order  data  is  non-rengable.  Using  simulation,  we  showed  that 
SACKs  result  in  inevitable  send  buffer  wastage,  which  increases  as  the  frequency  of 
loss  events  and  loss  recovery  durations  increase.  We  introduced  a  fundamentally  new 
ack  mechanism,  Non-Renegable  Selective  Acknowledgments  (NR-SACKs),  for  SCTP. 
An  SCTP  receiver  used  NR-SACKs  to  explicitly  identify  some  or  all  out-of-order  data 
as  being  non-renegable,  allowing  the  sender  to  free  up  send  buffer  sooner  than  if  the 
data  were  only  SACKed.  Simulation  comparisons  showed  that  NR-SACKs  enabled  (i) 
efficient  utilization  of  a  transport  sender’s  memory,  and  (ii)  throughput  improvements 
in  CMT.  We  are  currently  working  with  the  IETF  to  standardize  NR-SACKs  for 
SCTP. 

5.3  Issue  (3):  CMT  during  Path  Failures 

We  demonstrated  that  CMT  suffers  from  significant  throughput 
degradation  during  permanent  and  short-term  path  failures.  We  introduced  a  new 
destination  state  called  the  “Potentially  Failed”  (PF)  state.  CMT’s  failure  detection  and 
(re)transmission  policies  were  augmented  to  include  the  PF  state.  The  modified  CMT, 
called  CMT-PF,  outperformed  CMT  during  failures  -  even  under  aggressive  failure 
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detection  thresholds.  During  non- failure  scenarios  such  as  congestion,  CMT-PF 
performed  on  par  or  better  but  never  worse  than  CMT.  In  light  of  these  findings,  we 
recommend  CMT  be  replaced  by  CMT-PF  in  existing  and  future  CMT  implementations 
and  RFCs. 
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